-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for inlined compressed data (minimal changes) #202
Add support for inlined compressed data (minimal changes) #202
Conversation
|
||
// A list of acceptable encodings to used for inlined data. Must be | ||
// `IDENTITY` (or unspecified), or one or more of the compressors | ||
// supported by the server. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this phrasing is far too strict. Why does it need to contain compressors supported by the server? Couldn't the server just ignore the ones it doesn't support?
Furthermore, what's the point in making IDENTITY explicit in this list? I would do it the other way around: forbid IDENTITY from being listed explicitly, allowing servers to always pick this compressor for responses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Must/Should/ sounds wise. Note this is just a quick first pass, for comparison with #203.
If a client specifies IDENTITY
here, the server knows it's not an old client, and can inline uncompressed data in the Blob instead of in the legacy field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this PR doesn't contain Blob.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, this is probably something that can be dropped from this option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for keeping this simple (and same for other examples). "A list of acceptable encodings to used for inlined data. IDENTITY is always allowed and does not need to be specified explicitly." ?
// Whether compression of inlined data is also supported (if so, the | ||
// server must support each of the compressors specified in the | ||
// `supported_compressor` field). | ||
bool inlined_compressed_blobs = 7; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relatedly, I also think that this phrasing far too strict. With the repeated field that you're adding, you've already introduced a two-way method for negotiating compression algorithms on a per-request basis.
Do we even need this flag? Why not always allow clients to just provide a list of compression functions, regardless of whether the server does anything with it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this, I think we'd only have negotiation for downloading inlined blobs. Clients need to know if the new inlined data format is acceptable by the server for UpdateActionResult or BatchUpdateBlobs.
If we don't require that all of the advertised bytestream compressors are allowed for inlined data, then the server needs a good way to tell the client that a given compressor isn't accepted for uploads. We could alternatively specify a list of acceptable inlined blob upload compressors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the use of uploading inlined blobs? Can't you just run multiple uploads in parallel?
That's where download and upload differs: in the case of uploading you already know what the shape of the entire Merkle tree is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible to upload blobs with bytestream in parallel, but it's often simpler to do it in a single RPC call, and batching small requests together can reduce certain overheads (it's a balancing act as always).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you got numbers on what the overhead is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the comment. This only applies to client-sent data since we have allowed the server to always return uncompressed data if it chooses to. It would be toothless to state here that servers must also support this for downloads.
If we want to allow a different set of compressors for uploads and downloads, we could use a second repeated Compressor.Value
field instead of a simple bool
field. But it feels unlikely to me that a server would want this flexibility.
Jumping back to @EdSchouten's question...
What's the use of uploading inlined blobs?
This (small) PR doesn't add support for inlining blobs in new places. It just allows compressed inlined data where we previously only allowed uncompressed inlined data. Which IMO is a net benefit for servers that choose to support inlined data for uploads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This (small) PR doesn't add support for inlining blobs in new places. It just allows compressed inlined data where we previously only allowed uncompressed inlined data.
I get that. But there's also protocol evolution, right? Does it make sense to extend features that would in retrospect be considered redundant/unnecessary? If we're adding support for compressed uploads in BatchUploadBlobs(), does it even make sense to invest in extending UpdateActionResult() to support it?
I'm not proposing that we remove support for inlining data into UpdateActionResult(). I'm merely saying: why put jewellery on it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this is a useful improvement for UpdateActionResult rather than something we would want to discourage. On the server side it is probably not much harder to implement this for both messages than for only BatchUpdateBlobs.
I wouldn't be opposed to advertising the capabilites of each rpc call separately, but I'm not sure if that would help any server imlementations in practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upon re-reading the API, I'm actually leaning towards Ed's suggestion of prohibiting compressed blob inlining in UpdateActionResult. While it's hopefully not that hard, it is a new set of fields and whatnot to support compression on (clients don't set stdout/stderr fields anywhere else, output contents, etc; Servers don't usually read it), and allowing it would force servers to support rewriting compressed fields on the way out (e.g. if one client uploads an action result with inlined compressed data, and a second client calls GetActionResult and does not specify compression being allowed.)
Of course, if anyone actually cares to have support of compressed+inlined uploads, that's a different story entirely. But right now it sounds like it's unmotivated by concrete use-cases, and mostly desirable for symmetry? Which I care about as well, though perhaps the tax in this case outweighs it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't servers already need to support modifying update ActionResult messages though? There's nothing preventing clients and workers from uploading ActionResult messages with inlined uncompressed data, and servers must not inline this data unless requested by the client in response to GetActionResult messages.
The strategy I implemented in bazel-remote was to accept ActionResult messages with possibly inlined data, store any inlined blobs separately in the CAS since they may be requested separately, de-inline them from the ActionResult before storing that in the ActionCache, and then returning that possibly modified Actionresult.
This allows clients and workers to upload the results of executing an action in a single RPC call in the best case. If we limit compressed inlined blob support to BatchUpdateBlobs for uploads, then the client/worker needs to make at least two RPC calls if you want to use compression (but not gRPC level compression). Even if they're not sequential it's simpler to do this in a single RPC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM in general - two outstanding comment threads to resolve, but I expect to be fine with whatever exact text you settle on, nothing controversial there.
|
||
// A list of acceptable encodings to used for inlined data. Must be | ||
// `IDENTITY` (or unspecified), or one or more of the compressors | ||
// supported by the server. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for keeping this simple (and same for other examples). "A list of acceptable encodings to used for inlined data. IDENTITY is always allowed and does not need to be specified explicitly." ?
// Whether compression of inlined data is also supported (if so, the | ||
// server must support each of the compressors specified in the | ||
// `supported_compressor` field). | ||
bool inlined_compressed_blobs = 7; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For most use-cases it doesn't matter, but when uploading lots of small blobs it can be very material. With e.g. 1kb blobs, you're likely to be parallelism bound (100 RPCs in flight / connection -> 1000 blobs/second/connection or so); batch uploads are then very necessary if you've got thousands of blobs to upload.
For the field comment: should this apply to all inline data, or just be scoped to client-sent data? I.e. I'm wondering about changing it to "Whether compression of client-sent inlined data is also supported". That'd allow a server to support returning compressed data (via client-specified acceptable_compressors) without having to support accepting it.
Though...maybe that's a bad thing, and we should keep the existing semantics so as to introduce less variation in servers.
// A list of acceptable encodings to used for inlined data. Must be | ||
// `IDENTITY` (or unspecified), or one or more of the compressors | ||
// supported by the server. | ||
Compressor.Value compressor = 9; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a repeated field, named "acceptable_compressors" for consistency.
19e84ba
to
7df50ab
Compare
0ca8b77
to
4d680ac
Compare
I can no longer respond to the thread above, so here it goes:
The spec is a bit incomplete here. There is nothing in the spec that states that clients even MAY/SHOULD/${VERB} inline data as part of GetActionResult. For example, take the description of stdout_raw: // The standard output buffer of the action. The server SHOULD NOT inline
// stdout unless requested by the client in the
// [GetActionResultRequest][build.bazel.remote.execution.v2.GetActionResultRequest]
// message. The server MAY omit inlining, even if requested, and MUST do so if inlining
// would cause the response to exceed message size limits.
bytes stdout_raw = 5; Notice how the entire paragraph discusses the semantics of GetActionResult, but there is not a single mention of "upload" or "UpdateActionResult". There are also boolean flags as part of GetActionResultRequest to control what may be inlined, but there are no corresponding boolean flags in Capabilities that control this feature in the upload case. This raises the question: was inlining stdout/stderr as part of UpdateActionResult even considered in the first place? Buildbarn, for example, tolerates it, but then goes ahead and stores the ActionResult in literal form. This means that if a client later on requests the ActionResult with inline_stdout/stderr set to false, Buildbarn goes out of spec: it will return stdout/stderr inline. As far as I know, you are interested in the remote caching case; not remote execution. Let's take a step back and look at what you're trying to achieve. Consider this workflow:
So that's at least three RPCs: GetActionResult(), FindMissingBlobs() and UpdateActionResult(). At most, it's four RPCs: all of the previous ones and BatchUpdateBlobs(), in case one or more outputs were missing. My proposal is to not add support for compressed uploads to UpdateActionResult(). Instead, you make the stdout and stderr part of the FindMissingBlobs() and BatchUpdateBlobs() calls and provide the digests to UpdateActionResult(). This keeps the protocol lean and immediately acts as a soft-deprecation of inline uploads. The only downside of this approach is that you may need to perform four RPCs more frequently, namely in the case where stdout or stderr were reported missing, but all other outputs were present. My assumption is that this only happens rarely. Most actions don't even generate stdout/stderr. And even if this is not rare, there is so much more opportunity here. With very little effort, we can extend the Remote Execution protocol to always allow you to complete such actions with three RPCs. Here's the diff (lacking capabilities, etc.):
Another bonus of this change is that it adds batching support to the AC. Extend BatchReadBlobs() similarly and you will cut down the number of RPCs tremendously. Once done, we can soft-deprecate GetActionResult() and UpdateActionResult() as well. So from a Pareto point of view, it's absolutely not evident to me why you want to add support for inlined compressed data to UpdateActionResult(). You are trying to add a feature to a part of the spec that is already a dead-end street. |
I'm working both on cache-only and remote execution scenarios (execution workers have a similar process to cache clients). I have a few cases where I use a specialised cache client for actions whose outputs are reusable, but the outputs are very unreproducible, so it's not worth calling FindMissingBlobs. In the best case (when all the data fits inline) it's 1 RPC call for a cache hit (GetActionResult), and 2 RPC calls for a cache miss (GetActionResult, UpdateActionResult). If the data doesn't all fit, then the client has to make more RPC calls to transfer the leftover blobs (hence why adding support for compression here would be useful to me, at least). I guess I have been using an underspecified corner of the spec then, clients/workers inlining data where it's possible and not forbidden. Should we continue to allow that (and possibly add support for compressed data), or recommend against it? Or prohibit it? @EdSchouten: do you want to open another PR for your idea? It's a larger change, but I think it would cover my usecases. |
I'm personally fine with prohibiting it. My only concern is that someone else will show up within the next couple of days and call me crazy. In case that happens, I would at least suggest that we recommend against it. @EricBurnett, what are your thoughts?
I'd say, let's first work towards adding compression support to BatchUpdateBlobs() as part of this PR, while leaving UpdateActionResult() as is. Once that lands, merging the functionality of the latter into the former is a logical extension. |
Filed #206 to consider client-inlining separately, so it isn't lost in this discussion. |
// [BatchUpdateBlobsRequest][build.bazel.remote.execution.v2.BatchUpdateBlobsRequest], | ||
// [ActionResult][build.bazel.remote.execution.v2.ActionResult] and | ||
// [OutputFile][build.bazel.remote.execution.v2.OutputFile]. | ||
repeated Compressor.Value supported_inline_compressors = 7; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
supported_upload_compressors?
Considering that we're no longer doing inlining for uploads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be better to be more specific than supported_upload_compressors
. How about supported_batch_compressors
? Even though it's only strictly required to know in order to use compressed data in BatchUpdateBlobsRequest
, the same values will probably be the ones selected from for BatchReadBlobsResponse
.
And what do you think about renaming supported_compressors
to supported_bytestream_compressors
?
a3d8107
to
524b34a
Compare
Rebased this based on feedback in the meeting earlier this week, and limited it to the batch APIs. |
…dateBlobs Sample implementation of this PR: bazelbuild/remote-apis#202
…dateBlobs Sample implementation of this PR: bazelbuild/remote-apis#202
Here's a sample implementation for bazel-remote: buchgr/bazel-remote#459 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - one stale comment to correct, but otherwise looks good; the main outstanding point about inline uploaded data has been mooted by #206.
I'll wait till Ed signs off (and/or has additional comments) before merging this though.
@@ -1510,6 +1510,11 @@ message BatchUpdateBlobsRequest { | |||
|
|||
// The raw binary data. | |||
bytes data = 2; | |||
|
|||
// The format of `data`. Must be `IDENTITY` (or unspecified), or one of the | |||
// compressors advertised by the `CacheCapabilities.supported_compressor` if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
supported_compressor -> supported_batch_compressors , and I think inlined_compressed_blobs is no longer a field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
45bf9af
to
524b34a
Compare
524b34a
to
ba8a854
Compare
// Compressors supported for inlined data in | ||
// [BatchUpdateBlobs][build.bazel.remote.execution.v2.ContentAddressableStorage.BatchUpdateBlobs] | ||
// and | ||
// [BatchReadBlobs][build.bazel.remote.execution.v2.ContentAddressableStorage.BatchReadBlobs] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me it's still not clear why we want to mention BatchReadBlobs here. In the case of batch reading, this feature is already self-negotiating.
To clarify why I care: Buildbarn has a feature that allows you to integrate foreign corpora into the CAS (using a feature called Indirect Content Addressable Storage, or ICAS for short). It is not unrealistic to integrate a corpus that is already compressed using algorithm A, even though the rest of the CAS is already compressed using algorithm B. This doesn't mean that it's appreciated if you upload new blobs using algorithm A. In other words, the set of compression algorithms for uploading and downloading may differ.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BatchReadBlobs has basic negotiation- the client will never get blobs in a compressed encoding that it didn't ask for. But by itself it doesn't allow the client to prioritise compressors.
My thought here was that if the server advertises the supported compressors ahead of time, then the client can pick its favourite from the advertised list and make a request for that. I suspect that most servers which implement this will support the same set of compressors for BatchReadBlobs and BatchUpdateBlobs, and we could therefore use the same field to cover both. Maybe it would be better to use separate fields though.
Alternatively, maybe we could just state that BatchReadBlobsRequest.acceptable_compressors
are specified in the client's order of preference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m a fan of the latter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WFM- done. Renamed the capabilities field to supported_batch_update_compressors
.
ba8a854
to
5c81b67
Compare
5c81b67
to
37d9579
Compare
|
||
// A list of acceptable encodings to used for inlined data, in order of | ||
// preference from most to least preferred. `IDENTITY` is always allowed | ||
// even if not specified here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So based on this phrasing, if I say this:
["ZSTD", "DEFLATE", "IDENTITY"]
Then that would be the same thing as this:
["ZSTD", "DEFLATE"]
That makes sense. But what about this?
["ZSTD", "IDENTITY", "DEFLATE"]
Considering that this list is in order of preference, and that IDENTITY
is always allowed, then would that be the same as the client providing this?
["ZSTD"]
I think we should explicitly forbid IDENTITY
from being part of this. It only leads to confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The server is not required to select the client's highest supported priority, so ["ZSTD", "IDENTITY", "DEFLATE"]
is not the same as ["ZSTD"]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that documented anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not currently.
How about replacing the last sentence with: "Servers MAY select any of the specified compressors or IDENTITY (even if it is not specified by the client), regardless of the client's order of preference."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know. Is it desirable to make the semantics as weak like this? If I'm the administrator of a build cluster, I think I'd want to be able to have a strong influence in what compression algorithm clients use. I'd get annoyed if the facilities provided by the protocol are effectively 'best effort'. Giving me a way to specify which compression algorithm is preferred, but then have some client deviate from that because it thinks it's smarter than me.
Regardless of us deciding whether the client or server's preference dominates, I do think that we should document a deterministic and logical algorithm for clients and servers to agree on a compression algorithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect there's no perfect algorithm here, and that we will need to live with some level of weakness. The client has the advantage of knowing the formats the server supports (either advertised by the capabilities API or by probing), and can try to force the server to pick the client's preference by making a request that only allows that format (or fall back to uncompressed data).
That's not great for server admins, because the server potentially has more information that it could exploit- eg it already has the blob in a particular encoding, or it knows that a particular encoding is more efficient, or it knows that CPU usage is high and doesn't want to reencode right now.
In practice I expect most implementations to only support a few compression formats, and that should limit the problem somewhat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the discussion in the meeting today, updated to:
"A list of acceptable encodings for the returned inlined data, in no particular order. IDENTITY
is always allowed even if not specified here."
37d9579
to
90f2cf7
Compare
@EdSchouten: how does this look now? |
This is a small API change which allows for inlined data to be compressed form in BatchReadBlobs and BatchUpdateBlobs calls. Refers to bazelbuild#201.
90f2cf7
to
d0b0312
Compare
…dateBlobs Sample implementation of this PR: bazelbuild/remote-apis#202
…BatchUpdateBlobs Sample implementation of this PR: bazelbuild/remote-apis#202
…BatchUpdateBlobs Sample implementation of this PR: bazelbuild/remote-apis#202
…BatchUpdateBlobs Sample implementation of this PR: bazelbuild/remote-apis#202
This is a small API change which allows for inlined data to be compressed.
Refers to #201.