-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC for integrating pcap-release with BOSH #640
Conversation
@maxmoehl pls review / comment |
@plowin FYI |
Co-authored-by: Patrick Lowin <patrick.lowin@sap.com>
@beyhan we're planning to set this RFC to "ready for review" by EOB July 12. If possible, please review / add your thoughts by then. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some editorial suggestions and a sentence about gRCP over HTTP/2, which should be mentioned for full context.
Co-authored-by: Alexander Lais <Alexander.lais@me.com>
@domdom82 I think we can set it to ready for review now? |
Thanks for everyone's review and comments! Setting to ready for review now. |
@cloudfoundry/toc Here is the PR to discuss integrating the pcap functionality with BOSH. |
It feels confusing to have the "architecture we're not going to be using" as the first thing you find in the RFC. Feels like "We want to build this" would be a better structure, and maybe talk in some footnotes about "we tried this, and it was painful because of X,Y,Z", but I'm also fine if you just left all the history out since the new plan just seems fundamentally cleaner and easier. |
@jpalermo point taken. The diagram shows the "as is" state today. It should be noted that this way works but had some inconvenience to it and the integration with CF won't go away as an option. Your point is valid though, we will provide another diagram that shows how (we think) an integration with BOSH could look like. |
I agree with @jpalermo about the need to focus on the proposed solution, not on the current non-integrated architecture (which is confusing for people who do not already have context). While reading through the proposal I also looked at a similar use case that already exists in the bosh cli (log streaming) which apparently is implemented over ssh: https://github.com/cloudfoundry/bosh-cli/blob/main/cmd/logs.go#L104 Would streaming over ssh and deduplicating in the bosh cli (so integrating a bit of the server-side component in the cli) be sufficient? I can see the following benefits which this approach:
If plain tcpdump would not be enough, we could consider adding a pcap binary to the stemcells, but this will mean people will have to update their stemcell to be able to use this functionality. |
Technically that would be possible but it would create quite a gap between the implementation of the cf and bosh case. We could run (1) Running this directly (e.g. the CLI itself opens the SSH connection to the selected VMs) would make this a completely separate implementation that only shares a very small portion of the code (the core packet-merge-logic of the API and small parts of the pcap CLI) with the existing code base. (2) With the bosh director running pcap-api we would have to have different API implementations for each case (cf: gRPC vs. bosh: SSH) that "only" share the downstream (client-facing) API spec / implementation. (3) Implement as proposed (pcap-API on bosh director, bosh-agent incorporates functionality from the pcap-agent). I have to admit that (1) has quite some points going for it. Besides the points you already mentioned:
(2) on the other hand doesn't seem very attractive to me as the overhead of maintaining two API implementations (CF probably can't use this trick 1) feels too high. It also requires changes to the director VM which raises the barrier of entry significantly compared to (1), still, (3) has an even higher barrier of entry since we also need to change the bosh-agent and stemcell. From a progress standpoint it's a bit unfortunate since we invested quite some work in the BOSH case. Most of that is shared with CF so it's not too bad but the CF case is not ready yet so we basically have a framework that, currently, has no use. But I would argue that this is mainly on us not raising this RFC earlier and we shouldn't consider this when choosing an option. Footnotes
|
@maxmoehl @rkoster, this is an interesting idea. We have some unique features in the pcap-api (central component) that weren't described in the RFC yet. We may need to flesh out the scenarios and error resilience in more detail. Using gRPC with bidirectional streams allows, from my point of view, unique features that shine with the central dedicated component (i.e. pcap-api):
Independent of, but supported by, gRPC we also have concurrent capture limitations per client IP address. The limit is configurable but ultimately should avoid abusing the feature, whether accidentally or on purpose. A final word on using SSH: The log forwarder seems to be a reasonably simple case of "collect the logs and send them on" without any interactivity. From my point of view this could be limiting for the pcap-release. Today we already have some scripts that allow automating calls to tcpdump on BOSH VMs and streaming via SSH. This is not particularly reliable, error prone and hard to control. Based on the experiences with those scripts we set out to design pcap-release, as it addresses many of those issues. I agree with @maxmoehl's concern that adding an entirely different implementation for data exchange, i.e. replace the gRPC mechanism with SSH or rather add SSH in addition to gRPC, will be difficult to maintain. I also agree that we should not consider our current implementation for this RFC, but we should consider the anticipated use case for CF and the potential for reuse between those two. A single code base that is maintained, debugged and improved will benefit both use cases at the same time, instead of spreading developers even thinner on supporting two almost disparate code bases. |
@peanball I would blame that mainly on the fact that those are shell scripts which are lacking tests, proper error handling and rely on some weird hacks that would not be needed if this were implemented in a language like Go. |
@rkoster streaming via SSH technically works but it's not very performant. The main issue is that SSH is made for text and pcap is a binary stream. There are ways to encode pcap as hex-encoded text in tcpdump (-xx flag) but you'll lose the timestamp format and it's no fun parsing. The idea to do the deduplication on the client and avoid streaming via the BOSH director / API sitting on it seems appealing, though it would mean a significant deviation from the CF case where this won't be possible for security reasons. So we would probably end up with a different code base for BOSH vs. CF for pcap. Further down the road, there will be discussions about bandwidth usage. When you capture a deployment with lots of instances with a coarse grained filter and try to stream all of that back to the client, you'll run into a congestion problem.
You could say all of that could also be done on the CLI, true. But the security shouldn't sit on the client side imo. A lot of server-side logic would be required on the CLI to make this work. |
Hi all, a verbal explanation and demo of the current version was just advertised by @ramiyengar, feel free to listen in tomorrow!
|
If ssh is not feasible, would it be possible to reuse some of the existing blobstore infrastructure to store the package captures? Bosh cli (start package capture) → director API → Nats message (capture filter + signed blobstore destination URL) → bosh agent → embedded pcap lib send capture to blobstore + nats message heartbeat with capture stats → back to director → bosh cli. This architecture is a lot more involved but is in line with existing communications paths. The pattern for nats message with signed blobstore URL is for example also used when compiling packages. |
@rkoster we discussed the ssh option internally and agreed that it could be done, though it would be lacking some of the "higher level" features that pcap-agent could do. the ssh solution would basically be a thin wrapper around The upside I see here:
The downside:
I'd say if the community can live with these shortcomings, the ssh approach may be more "fitting" with the BOSH ecosystem. |
Going over all the discussions and the summary provided by @domdom82 my personal favorite is the
In this comment @domdom82 raised some security concerns. I don't think that they are relevant for the |
@maxmoehl could you update to RFC to reflect the |
We are currently working on an updated version of the RFC and will share that soon. |
@rkoster, we have an update almost ready. It's being reviewed internally. Should be able to post it soon. |
@rkoster it's addressed now in the latest state. The RFC text changed significantly in terms of structuring, so please have another look. Thanks! |
@peanball thanks for the update! I read the latest version and I don't have any requests for changes. I added this RFC as a discussion topic to our meeting notes for the next FI WG meeting this Thursday 3th of August. If any of you would like to participate in the discussions please join our FI WG meeting. The details are available in the CF community calendar. |
@beyhan yes, that's correct. In the "streaming only" scenario there is no security concern, because the pcaps are not stored elsewhere. That would only occur if the pcap was first buffered on an external datastore. |
As discussed in the Foundational Infrastructure Working Group meeting yesterday, I've added the note that Option 2 "pcap-lite" is the selected option. The description of the other option and the comparison of the two are retained as is for reference and historical context. |
@beyhan Motion to trigger Final Comment Period (FCP) for this RFC. I think we have reached the end of discussions for now. |
FCP should end on 15.08.2023 |
Hi all, we created an initial implementation of pcap-lite with cloudfoundry/bosh-cli#627. It's still in draft state. Preliminary feedback, especially from the BOSH and bosh-cli perspective is welcome. |
This PR adds an RFC to propose integration of pcap-release with BOSH.
For easier viewing: https://github.com/domdom82/community/blob/rfc-bosh-pcap/toc/rfc/rfc-draft-pcap-bosh.md