Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC for integrating pcap-release with BOSH #640

Merged
merged 9 commits into from
Aug 15, 2023

Conversation

domdom82
Copy link
Contributor

@domdom82 domdom82 commented Jul 6, 2023

This PR adds an RFC to propose integration of pcap-release with BOSH.

For easier viewing: https://github.com/domdom82/community/blob/rfc-bosh-pcap/toc/rfc/rfc-draft-pcap-bosh.md

@domdom82
Copy link
Contributor Author

domdom82 commented Jul 6, 2023

@maxmoehl pls review / comment

@domdom82
Copy link
Contributor Author

domdom82 commented Jul 6, 2023

@plowin FYI

toc/rfc/rfc-draft-pcap-bosh.md Outdated Show resolved Hide resolved
toc/rfc/rfc-draft-pcap-bosh.md Outdated Show resolved Hide resolved
toc/rfc/rfc-draft-pcap-bosh.md Outdated Show resolved Hide resolved
toc/rfc/rfc-draft-pcap-bosh.md Outdated Show resolved Hide resolved
Co-authored-by: Patrick Lowin <patrick.lowin@sap.com>
@domdom82
Copy link
Contributor Author

@beyhan we're planning to set this RFC to "ready for review" by EOB July 12. If possible, please review / add your thoughts by then. Thanks!

Copy link
Contributor

@peanball peanball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some editorial suggestions and a sentence about gRCP over HTTP/2, which should be mentioned for full context.

toc/rfc/rfc-draft-pcap-bosh.md Outdated Show resolved Hide resolved
toc/rfc/rfc-draft-pcap-bosh.md Outdated Show resolved Hide resolved
toc/rfc/rfc-draft-pcap-bosh.md Outdated Show resolved Hide resolved
toc/rfc/rfc-draft-pcap-bosh.md Show resolved Hide resolved
Co-authored-by: Alexander Lais <Alexander.lais@me.com>
@maxmoehl
Copy link
Member

@domdom82 I think we can set it to ready for review now?

@domdom82
Copy link
Contributor Author

Thanks for everyone's review and comments! Setting to ready for review now.

@domdom82 domdom82 marked this pull request as ready for review July 12, 2023 07:56
@domdom82 domdom82 changed the title initial draft for bosh pcap rfc RFC for integrating pcap-release with BOSH Jul 12, 2023
@domdom82
Copy link
Contributor Author

@cloudfoundry/toc Here is the PR to discuss integrating the pcap functionality with BOSH.
The proposal was originally discussed with @rkoster on CF Day 2023. Feel free to chime in and discuss!

@beyhan beyhan requested review from a team, rkoster, beyhan, stephanme, ameowlia, ChrisMcGowan and jpalermo and removed request for a team July 13, 2023 08:44
@jpalermo
Copy link
Member

It feels confusing to have the "architecture we're not going to be using" as the first thing you find in the RFC.

Feels like "We want to build this" would be a better structure, and maybe talk in some footnotes about "we tried this, and it was painful because of X,Y,Z", but I'm also fine if you just left all the history out since the new plan just seems fundamentally cleaner and easier.

@domdom82
Copy link
Contributor Author

@jpalermo point taken. The diagram shows the "as is" state today. It should be noted that this way works but had some inconvenience to it and the integration with CF won't go away as an option. Your point is valid though, we will provide another diagram that shows how (we think) an integration with BOSH could look like.

@rkoster
Copy link
Contributor

rkoster commented Jul 18, 2023

I agree with @jpalermo about the need to focus on the proposed solution, not on the current non-integrated architecture (which is confusing for people who do not already have context).

While reading through the proposal I also looked at a similar use case that already exists in the bosh cli (log streaming) which apparently is implemented over ssh: https://github.com/cloudfoundry/bosh-cli/blob/main/cmd/logs.go#L104

Would streaming over ssh and deduplicating in the bosh cli (so integrating a bit of the server-side component in the cli) be sufficient? I can see the following benefits which this approach:

  1. Only need to touch the cli, people would be able to use the functionality in older environments by just updating the cli.
  2. Heavy lifting will be performed by the machine executing the cli instead of the bosh director
  3. No need to open up additional ports (ssh from director to vms, when using director as gateway should already work in most environments)

If plain tcpdump would not be enough, we could consider adding a pcap binary to the stemcells, but this will mean people will have to update their stemcell to be able to use this functionality.

@maxmoehl
Copy link
Member

maxmoehl commented Jul 18, 2023

Would streaming over ssh and deduplicating in the bosh cli (so integrating a bit of the server-side component in the cli) be sufficient?

Technically that would be possible but it would create quite a gap between the implementation of the cf and bosh case.

We could run tcpdump -w - [...] via SSH and parse the output using libpcap (wrapped in gopacket), all using the bosh CLI.

(1) Running this directly (e.g. the CLI itself opens the SSH connection to the selected VMs) would make this a completely separate implementation that only shares a very small portion of the code (the core packet-merge-logic of the API and small parts of the pcap CLI) with the existing code base.

(2) With the bosh director running pcap-api we would have to have different API implementations for each case (cf: gRPC vs. bosh: SSH) that "only" share the downstream (client-facing) API spec / implementation.

(3) Implement as proposed (pcap-API on bosh director, bosh-agent incorporates functionality from the pcap-agent).

I have to admit that (1) has quite some points going for it. Besides the points you already mentioned:

  • simple setup / architecture
  • less moving parts that can break / need maintenance

(2) on the other hand doesn't seem very attractive to me as the overhead of maintaining two API implementations (CF probably can't use this trick 1) feels too high. It also requires changes to the director VM which raises the barrier of entry significantly compared to (1), still, (3) has an even higher barrier of entry since we also need to change the bosh-agent and stemcell.

From a progress standpoint it's a bit unfortunate since we invested quite some work in the BOSH case. Most of that is shared with CF so it's not too bad but the CF case is not ready yet so we basically have a framework that, currently, has no use. But I would argue that this is mainly on us not raising this RFC earlier and we shouldn't consider this when choosing an option.

Footnotes

  1. We can't be certain that tcpdump is available, the SSH user might not have sufficient permissions, etc.

@peanball
Copy link
Contributor

@maxmoehl @rkoster, this is an interesting idea. We have some unique features in the pcap-api (central component) that weren't described in the RFC yet. We may need to flesh out the scenarios and error resilience in more detail.

Using gRPC with bidirectional streams allows, from my point of view, unique features that shine with the central dedicated component (i.e. pcap-api):

  1. Two-way communication via gRPC streams for clean shutdown of a capture, flushing all data captured and buffered up until that point.
  2. Congestion control. When streaming data to a pcap-client, pcap-api uses a limited buffer towards each of the clients. When the buffer runs full, it can shed load (drop packets) and inform the client that it did so. This is an indication for the user to e.g. refine the filter.
    The same happens between pcap-agent and pcap-api. When a single agent generates more traffic than it can send to the pcap-api (or the pcap-api can accept), it also sheds packets and sends a message that it did so. This message is passed on to the client as well.
  3. Capture Stop on Disconnect. When a client disappears, pcap-api will actively stop ongoing captures destined for that client on the pcap-agents.

Independent of, but supported by, gRPC we also have concurrent capture limitations per client IP address. The limit is configurable but ultimately should avoid abusing the feature, whether accidentally or on purpose.

A final word on using SSH: The log forwarder seems to be a reasonably simple case of "collect the logs and send them on" without any interactivity. From my point of view this could be limiting for the pcap-release.

Today we already have some scripts that allow automating calls to tcpdump on BOSH VMs and streaming via SSH. This is not particularly reliable, error prone and hard to control. Based on the experiences with those scripts we set out to design pcap-release, as it addresses many of those issues.

I agree with @maxmoehl's concern that adding an entirely different implementation for data exchange, i.e. replace the gRPC mechanism with SSH or rather add SSH in addition to gRPC, will be difficult to maintain.

I also agree that we should not consider our current implementation for this RFC, but we should consider the anticipated use case for CF and the potential for reuse between those two. A single code base that is maintained, debugged and improved will benefit both use cases at the same time, instead of spreading developers even thinner on supporting two almost disparate code bases.

@maxmoehl
Copy link
Member

Today we already have some scripts that allow automating calls to tcpdump on BOSH VMs and streaming via SSH. This is not particularly reliable, error prone and hard to control. Based on the experiences with those scripts we set out to design pcap-release, as it addresses many of those issues.

@peanball I would blame that mainly on the fact that those are shell scripts which are lacking tests, proper error handling and rely on some weird hacks that would not be needed if this were implemented in a language like Go.

@domdom82
Copy link
Contributor Author

@rkoster streaming via SSH technically works but it's not very performant. The main issue is that SSH is made for text and pcap is a binary stream. There are ways to encode pcap as hex-encoded text in tcpdump (-xx flag) but you'll lose the timestamp format and it's no fun parsing.
In fact, we have such a solution in place right now, but people ended up running tcpdump remotely and storing the capture file in /tmp and then download all of the pcaps via scp and then merge them using mergecap. Oh well.. it "works" 😄

The idea to do the deduplication on the client and avoid streaming via the BOSH director / API sitting on it seems appealing, though it would mean a significant deviation from the CF case where this won't be possible for security reasons. So we would probably end up with a different code base for BOSH vs. CF for pcap.

Further down the road, there will be discussions about bandwidth usage. When you capture a deployment with lots of instances with a coarse grained filter and try to stream all of that back to the client, you'll run into a congestion problem.
For that scenario, we've thought about a second use case for that in-between API node:

  • The PCAP agents would not stream large network dumps directly back to the client.
  • Instead they stream to a cloud storage like S3 and return the object id to the API
  • During streaming, the client only sees meta data (e.g. how many packets have been captured, how much data has been transferred etc.)
  • When the client stops the capture, the API will make sure to collect all of the S3 objects related to this capture, download them and stream them back to the client as one pcap file URL that the client can download.
  • The API is also responsible for security: To make sure the client only downloads their captures, not yours.

You could say all of that could also be done on the CLI, true. But the security shouldn't sit on the client side imo. A lot of server-side logic would be required on the CLI to make this work.

@plowin
Copy link
Contributor

plowin commented Jul 19, 2023

Hi all, a verbal explanation and demo of the current version was just advertised by @ramiyengar, feel free to listen in tomorrow!

Link to stream: https://www.cloudfoundry.org/event/tcpdump-for-everyone/
Time: 4 pm CEST/3 pm London/10 am Eastern
Date: 20th July 2023

@rkoster
Copy link
Contributor

rkoster commented Jul 26, 2023

If ssh is not feasible, would it be possible to reuse some of the existing blobstore infrastructure to store the package captures?

Bosh cli (start package capture) → director API → Nats message (capture filter + signed blobstore destination URL) → bosh agent → embedded pcap lib send capture to blobstore + nats message heartbeat with capture stats → back to director → bosh cli.

This architecture is a lot more involved but is in line with existing communications paths. The pattern for nats message with signed blobstore URL is for example also used when compiling packages.

@domdom82
Copy link
Contributor Author

@rkoster we discussed the ssh option internally and agreed that it could be done, though it would be lacking some of the "higher level" features that pcap-agent could do. the ssh solution would basically be a thin wrapper around sudo tcpdump -w - which may be sufficient in 80% of the cases. we agreed that we should probably make a POC for that approach in the context of this RFC, based on your suggestions about log streaming.

The upside I see here:

  • Ease of use (no additional infrastructure needed, can be integrated directly into BOSH cli without other components)
  • BOSH director won't be involved and won't take the burden of forwarding pcap traffic.
  • No agents, addons, apis, etc.

The downside:

  • The implementation will be BOSH-only, so the code in BOSH cli will have no touch points with pcap-release for CF.
  • Only basic tcpdump captures are possible. No higher-order stuff like "capture traffic between deployment A and B" or "capture only the first 10 packets of each conversation" (we might be able to do something like that if we switch to tshark instead of tcpdump, though).
  • Bandwidth limited to whatever the CLI machine can take, e.g. not usable to capture lots of traffic.

I'd say if the community can live with these shortcomings, the ssh approach may be more "fitting" with the BOSH ecosystem.

@maxmoehl
Copy link
Member

@rkoster @domdom82 when this was initially raised I wanted to know how much effort this would be, turns out the effort is fairly minimal as you can see in this gist. Sure we still need to find the right VMs and do the merging but we've implemented those things one way or another already.

@beyhan
Copy link
Member

beyhan commented Jul 28, 2023

Going over all the discussions and the summary provided by @domdom82 my personal favorite is the ssh approach because:

  • we can address basic needs for tcpdump with a simple solution
  • we can collect feedback and based on that decide for further steps
  • working with bosh for some time now I personally found my way to debug network issues with the available BOSH commands but I can see use cases where a tcpdump could be helpful
  • I think that the CF app developer use case for tcpdump and the BOSH operator use case for tcpdumps are different.

In this comment @domdom82 raised some security concerns. I don't think that they are relevant for the bosh ssh approach. @domdom82 is that correct?

@rkoster
Copy link
Contributor

rkoster commented Aug 1, 2023

@maxmoehl could you update to RFC to reflect the bosh ssh based direction? If you also agree with this direction (I'm assuming so based on the fact that you added a 👍 to Beyhan's comment above).

@maxmoehl
Copy link
Member

maxmoehl commented Aug 1, 2023

@maxmoehl could you update to RFC to reflect the bosh ssh based direction? If you also agree with this direction (I'm assuming so based on the fact that you added a 👍 to Beyhan's comment above).

We are currently working on an updated version of the RFC and will share that soon.

@peanball
Copy link
Contributor

peanball commented Aug 1, 2023

@rkoster, we have an update almost ready. It's being reviewed internally. Should be able to post it soon.

@peanball
Copy link
Contributor

peanball commented Aug 2, 2023

@rkoster it's addressed now in the latest state.

The RFC text changed significantly in terms of structuring, so please have another look. Thanks!

@beyhan
Copy link
Member

beyhan commented Aug 2, 2023

@peanball thanks for the update! I read the latest version and I don't have any requests for changes. I added this RFC as a discussion topic to our meeting notes for the next FI WG meeting this Thursday 3th of August. If any of you would like to participate in the discussions please join our FI WG meeting. The details are available in the CF community calendar.

@domdom82
Copy link
Contributor Author

domdom82 commented Aug 2, 2023

In this comment @domdom82 raised some security concerns. I don't think that they are relevant for the bosh ssh approach. @domdom82 is that correct?

@beyhan yes, that's correct. In the "streaming only" scenario there is no security concern, because the pcaps are not stored elsewhere. That would only occur if the pcap was first buffered on an external datastore.

@peanball
Copy link
Contributor

peanball commented Aug 4, 2023

As discussed in the Foundational Infrastructure Working Group meeting yesterday, I've added the note that Option 2 "pcap-lite" is the selected option. The description of the other option and the comparison of the two are retained as is for reference and historical context.

@domdom82
Copy link
Contributor Author

domdom82 commented Aug 8, 2023

@beyhan Motion to trigger Final Comment Period (FCP) for this RFC. I think we have reached the end of discussions for now.

@beyhan
Copy link
Member

beyhan commented Aug 8, 2023

FCP should end on 15.08.2023

@ameowlia ameowlia merged commit 78f94ce into cloudfoundry:main Aug 15, 2023
@peanball
Copy link
Contributor

Hi all, we created an initial implementation of pcap-lite with cloudfoundry/bosh-cli#627. It's still in draft state.

Preliminary feedback, especially from the BOSH and bosh-cli perspective is welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rfc CFF community RFC toc
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

10 participants