-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Servers must provide empty blobs that weren't uploaded #131
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion this is the worst outcome that could have come out of this discussion. It adds unnecessary complexity to a system that would otherwise have been completely regular. To me it is also not evident how this leads to any meaningful increase in performance. Any client with some level of smartness would only download the empty blob once, due to it also using CAS-based storage locally. Because CAS blobs are (generally) only uploaded after calling FindMissingBlobs(), it would also save just one upload. It also doesn't reduce the total number of FindMissingBlobs() calls, as it is extremely unlikely that FindMissingBlobs() is called on just the empty blob.
Smarter ways to solve this would either have been to teach Bazel to upload empty blobs, or to simply disallow Digest messages to exist with size_bytes <= 0
. References to empty objects could use null Digests. The latter would solve the problem in a completely obvious, non-ambiguous way, while also reducing the size of, most notably, ActionResult.
That said, I suspect that I am in a minority here, so let's just get this over with.
Note that I'm documenting existing behaviour for compatibility reasons, not advocating for the design.
Agreed- I initially tried this approach with bazel-remote based on the spec, but had to resort to this behaviour in order to support bazel. Something to improve for REAPIv3 (since we can't do it without breaking existing clients)? |
Some clients refer to empty blobs without previously uploading them, as a performance optimization. Servers must support this. Perhaps we can improve on this for REAPIv3, eg require that empty blobs be represented by null Digests.
af403fc
to
26c67c2
Compare
FWIW, I agree with @EdSchouten. Seems backwards to change the spec to be more complex, just because one client (bazel) is not conforming to the existing spec. Why not fix this on bazel side instead? |
IIUC there are many bazel projects still using older versions of bazel, so even if we could convince the bazel team to accept such a change we would probably suffer from incompatibility problems indefinitely. |
A few reasons.
|
Similarly, we independently implemented this optimisation in another client (Pants) and another server (Scoot) because we found it both simplified implementation, and had noticeable performance effects. |
We had a release blocker because of this special case (thanks to the person who started this thread for us finding it ;-), and it also seems to increase server-side complexity rather than decrease. A server can still special-case this even if it isn't in the spec, so I think the primary question is whether it improves the client or not. I don't immediately see the client-side value - the client already has to keep track of all the other things, and adding one more doesn't seem to make a big difference. |
The reason we didn't see it before is that it only caused an issue for us if the client attempts to run an action consuming an empty file as an input on a fresh cluster. The empty output is so common that it's basically guaranteed to be there after a few actions. |
What I've observed is that this optimization is just a drop in the ocean compared to what may actually be done to improve performance. Right now Bazel makes absolutely no attempt to cache knowledge of which blobs exist on the server. FindMissingBlobs() calls tend to be huge, relative to what is actually meaningful to request at that point in time. For a workload of mine I have observed that if Bazel were to just cache blob existence info for a single minute (which seems very fair), I can obtain a >99.8% hit rate for blobs that ended up existing. Such an optimization has a lot more impact than distinguishing between the empty blob. |
To put numbers to this, I checked and our special-casing of the empty blob is approximately 20 LoC in our server in total, not all of which are strictly required (we do IsEmpty() checks at various layers, sometimes redundantly). These are pretty boring lines of code, e.g. for ByteStream.Read the entire special-case looks like I can't quantify the savings we get from knowing that our storage should never legitimately see 0-byte blobs, so let's just ignore that and say that our anecdotal evidence is it's at worst +20 LoC for us.
Certainly bazel is super dumb about blob existence checks today, and so the presence or absence of the empty blob may not be material there. (Noting the fact that bazel exists in the wild with this assumption already, so this point is also somewhat moot.) Bot side it's a little different though, in that most actions output 2-3 blobs, one of which is usually the empty blob, and so having to upload it is in the realm of +50-100% of blob uploads. Which might be negligible (if all blobs fit in a batch upload, it's a few bytes in an existing RPC); and might not be (e.g. if there's one other file and it's >4MB, we'd be making an additional RPC purely for uploading the empty blob). Though I should also note that we don't require a REAPI spec change to be able to unblock our bots on this, so this is only an argument for why it's useful, not why it's required. |
I agree with @scele and @EdSchouten ; if this is supposed to be a generic API, it should be driven by what makes for a clean API rather than "what does Bazel do". I don't strongly disagree in that this isn't hard to implement on the server side, but as an optimisation it seems clearly better to be implemented on the client side, where a client conforming to the current API could simply optimise away some requests. Why isn't a SHOULD on the client side enough? |
Personally I'm fine with including the MUST language for the server side for v2. I think we can revisit in v3. I am not convinced we would end up in a different place. |
Wouldn't my proposal to use null digests for empty objects even be worth considering for a v3? |
@EdSchouten I think it certainly would be consider. Personally, I am just thinking that that may be more complexity for clients and servers than the current requirement for a server to always have the empty blob available. |
You’d be amazed, but that’s often not the case. For certain classes of programming languages, null Protobuf messages are mapped directly to null pointers, option types with value ‘None’, etc. In those programming languages you already need to do null pointer comparisons or pattern matching whenever you access a Digest that’s stored in a message, as it could be null right now. These comparisons could be used to skip loading empty blobs. This is what I meant with that using null Protobuf messages for this is more obvious and not ambiguous. |
I don't feel terribly strongly on this issue and will be fine either way,
but I also cast my vote for including the MUST language, on the grounds
that it's quite simple (and generally required) server-side, and materially
simplifies client-side empty-blob optimizations.
@EdSchouten <https://github.com/EdSchouten> I'm not sure I understand the
connection to using null Digests to represent the empty blob? Seems
somewhat orthogonal to me, in that it's mostly a question of whether Fetch
or FindMissingBlobs can ever fail on the empty blob, regardless of how we
describe "the empty blob". I think I'm missing some context here? At any
rate, I think your proposal to change how we describe the empty blob is
worth considering for V3, but I had thought it independent of this
discussion.
…On Fri, Apr 17, 2020 at 5:29 PM Ed Schouten ***@***.***> wrote:
@EdSchouten <https://github.com/EdSchouten> I think it certainly would be
consider. Personally, I am just thinking that that may be *more*
complexity for clients and servers than the current requirement for a
server to always have the empty blob available.
You’d be amazed, but that’s often not the case. For certain classes of
programming languages, null Protobuf messages are mapped directly to null
pointers, option types with value ‘None’, etc. In those programming
languages you already need to do null pointer comparisons or pattern
matching whenever you access a Digest that’s stored in a message, as it
could be null right now. These comparisons could be used to skip loading
empty blobs.
This is what I meant with that using null Protobuf messages for this
obvious and not ambiguous.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#131 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABREW3B22DBVJ65BFS5KQTRNDC4ZANCNFSM4MJHDL6A>
.
|
I think we have a slightly better chance of reducing interoperability problems if we say "MUST"- so this gets my vote. But I'm also OK with "SHOULD". Re improving this for REAPIv3, should we move that discussion somewhere else? (If so, where? Are there any other v3 changes already being discussed?) |
Re improving this for REAPIv3, should we move that discussion somewhere
else? (If so, where? Are there any other v3 changes already being
discussed?)
There are a few other desired changes pushed to V3, yes. @bergsieker has
started collecting them into
https://github.com/bazelbuild/remote-apis/milestone/1 , and I just filed
#132 to collect another
that has been discussed but doesn't have its own bug yet. I would suggest
filing a separate Issue for the V3 portion of the discussion and we can get
it tagged appropriately.
…On Sun, Apr 19, 2020 at 6:36 PM mostynb ***@***.***> wrote:
I think we have a slightly better chance of reducing interoperability
problems if we say "MUST"- so this gets my vote. But I'm also OK with
"SHOULD".
Re improving this for REAPIv3, should we move that discussion somewhere
else? (If so, where? Are there any other v3 changes already being
discussed?)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#131 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABREWZ4VG3ZYIWM57OTFV3RNN4FXANCNFSM4MJHDL6A>
.
|
This was made the default in bazelbuild/continuous-integration#957.
Some clients refer to empty blobs without previously uploading them, as a performance optimization. Servers must support this.