-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BuckEvent: provide a Buck Event Publisher proto #685
base: main
Are you sure you want to change the base?
Conversation
f2816de
to
3925a3c
Compare
Provide a BuckEvent Publisher service which emits BuckEvents to an external server implementation. Closes facebook#226
3925a3c
to
3572e6c
Compare
@cjhopman has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copying some comments left on phabricator:
I think we need the rest of the stack to accept this.
Is it not clear how this will be used.
(Also I'm not 100% sure grpc is the right choice, maybe simple piping to stdin would be sufficient).
I think I agree that I'd like to see a little more code before landing this just yet
|
||
message BuckEventResponse { | ||
// A trace-unique 64-bit identifying the stream. | ||
uint64 stream_id = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having stream_id we can just create a new grpc call for each stream, that would be more natural.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, to partially respond to Stiopa on this: That seems hard to practically implement on the buck2 side, since we may create a number of different instances of the client for each command. And even if we didn't do that, it seems unwise to prevent ourselves from doing that in the future
Actually, I have another thing to consider: Internally, we've found that the size of the event logs can be extremely substantial and that keeping the entire mechanism reliable hasn't been super straightforward. Now, while I don't if anyone else's repo is big enough to be affected by that, and I'm sure that some of the unreliability is a result of misbehaviors specific to our infra, I can also imagine that some of the lessons learned from that are more generally useful. To go into detail a bit, the way our scribe client works is roughly that it allocates the "real" client as a global, and gets a handle to it for each instance of the type you see. The global then keeps a queue of messages that is used to 1) batch writes, and 2) rate limit itself. While the rate limiting is probably not necessary in OSS, the write batching might be of interest. That might be something to consider for this API. Hopefully that information also explains what |
Summary: This won't be able to build in OSS, but we can at least make the code available. Of interest specifically to #685 Reviewed By: ndmitchell Differential Revision: D59262895 fbshipit-source-id: e4e72a402e174dedd08715a627a84e7b16d1a225
There seem to be 2 concerns here:
I plan to mimic Bazel's approach here. In Bazel, each "invocation" gets a unique stream ID. As events get sent to the server, the server will reply with "ack" (identified via trace_id) for events that have persisted. Network disruption could happen and in such cases, the client might want to retry sending events that were not "ack" by the server. Upon resend, the events could be sent to a different server instance, thus the
I am not sure why is batching needed. Multiple GRPC streams could multiplex over a single HTTP/2 connection so the overhead of sending events one-by-one is relatively low. I could see potential benefit in compressing multiple events together to send as a batch, but I would argue that it should be left to the server implementation to optimize. (i.e. some implementations could provide a local forwarding proxy that handles the batching). In Bazel, it's better to stream the events as soon as they arrive as some events contain the console log information, which enables the server to implement an alternative build UI on their website. I could imagine that we might be able to extend BuckEvent to support this use case (aka. the removed ControlEvent). Could you please elaborate a bit more on why batch sending needs to be implemented on the client side? Perhaps you are thinking of potential non-grpc client implementations? |
From my perspective, it could be potentially a huge time investment to code the rest of the implementation without agreeing on the spec first. At least some agreements on the grpc service would be a positive signal for me to put more time into this. The current design is modeled after With that said, I do understand that this is going to be a Buck2-specific API. So if there are any alternative proposals, I would love to hear them. |
So the thing to keep in mind is that our event sending is done synchronously in a bunch of latency sensitive places. While a benchmark would certainly be much better than guessing, my expectation is that doing a network op for each of these events might be prohibitively slow, at least in some of our builds. That being said, I don't know that we have to figure this out now. Happy to let you experiment and see if it matters or not.
Yeah, sorry, I was a bit too brief before in just asking for more code. I think you're right to ask for alignment on use of a GRPC API in general, and I'll bring that up at our team meeting tomorrow to make sure we're all good to move forward in general. The main thing I would ask for from your side is that the service definition itself is clearly marked as being unstable until we've gathered some experience with using it. The way we've reported events to our own service has changed quite a lot in the past, and I imagine that much of that configuration may at some point need room in the API (the first thing that comes to mind is that we only send a subset of events right now, and I imagine not all consumers are going to agree on which subset is the right one...). Other than that, I don't think it makes much of a difference whether we merge this PR now or wait for some additional code on the client before doing so, but hopefully the things I mentioned above are enough for you to be able to continue work on your side |
Thank you so much. Looking forward to this.
This is a good call out. In Bazel, there is a flag that could switch sending build events asynchronously vs synchronously. The reason for synchronous is that in many OSS setups, the CI worker is an ephemeral CI container/VM that will get shut down right after the build is finished. In those scenarios, it might benefit folks to turn on synchronous event sending so that their builds would wait for all events to be sent before exiting and initializing container shutdown. I think sending events in async/sync mode is tangent to sending events via stream/batch request. Let's not conflate the 2. There are ways to implement async event sink while using grpc bidi stream as well as batch request while supporting queue by-passing for send_now. For example, we could maintain an internal priority queue in Buck Daemon and send_now events will always get the highest priority. I think we could hash this out in future PRs when we dive into implementation details.
I have no problem with this. If there is any existing convention in Buck2 to mark unstable service, I would be happy to follow. If there isn't, I will name the configuration key to activate this Let me know what you prefer here.
I think the main diff will be who has to pay the rebase cost. If the review time for future PRs is short, I have no problem leaving this unmerged. However, if the lead time is long, I will have to rebase my PR stack and account for existing refactoring(codemod) in Buck2 repo, which could be a pain to deal with. So I would prefer to have this merged earlier rather than later. |
@JakobDegen friendly ping. I am looking for a confirmation on the gRPC API direction before investing more time into this. |
Alright, sorry for the delay. We had some back and forth on this internally and our ourselves not quite sure what we think the right answer is.
So we actually have the same problem, which is why even though our event sending is asynchronous in the sense that it doesn't do a network op on each
👍
I think a comment at the top of the proto file would be more than enough
Yeah that's fair. Let's merge this then and we can iterate I'll add one additional point that came up internally: Given how big logs can sometimes be, and that users are often working from home on not-great internet (or god forbid, from a hotel or plane or something), the default state was that we had complete data for not even 90% of builds. Fixing that to get 99% complete data was a significant investment for us. The part where we currently have machine local queues in a couple different places that allow us to bridge any periods of reduced internet connectivity were a very big contributing part of that. It's not blocking, but I wanted to at least raise awareness of that so you get a chance to consider it. If you're otherwise comfortable with this, let me know and I'll go ahead and merge it |
Related: #811 provides a draft implementation of BES support for Buck2. |
Provide a BuckEvent Publisher service which emits BuckEvents to an
external server implementation.
Closes #226