-
Notifications
You must be signed in to change notification settings - Fork 430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Capture partial frames for syscalls which exceed bpf LOOP
/CHUNK
limits
#1755
Comments
Great write-up @benkilimnik. Here are my initial thoughts. Option 1: We should definitely do this. It is the main way to capture more data. We had avoided this in the past because we didn't want to create different behavior on different kernels (which becomes a headache when users report issues, a headache for testing, etc.). However, I think we're at the point where we should cross this bridge. We should stay principled and make the kernel level differences only a few key parameters like the Option 2: This is a good idea, but will be complex. Note that in the case where we drop data, we still do track positions correctly for the DataStreamBuffer. That's because we take the total bytes from the syscall itself (see https://github.com/pixie-io/pixie/blob/b6cf0ac3e4723a409b41cc0b603f3b8ddc1001b3/src/stirling/source_connectors/socket_tracer/bcc_bpf/socket_trace.c#L819C5-L819C22). So we currently do create gaps where these losses occur. So the bulk of this work, in my mind, becomes letting the parsers know where missing data is located, and letting it try to parse regardless. It means we need a new data format (where gaps can be encoded), and an augmentation of all of the parsers (which we could of course do incrementally). So I think Option 2 is a worthy goal, but is complex. Option 3: We've looked into this before. I'm not sure this works that easily, because tail calls can't return, and there is processing that we do after the loop is complete. There would be a lot of state that would have to be passed from one call to another so that it could complete the work. Won't be pretty. In short, I would definitely do Option 1. It's easier than option 3 for capturing more data, and the older kernels are starting to become less prevalent, making it a good option. Then I would follow that up with Option 2, but treading very carefully. |
Thanks for the comments @oazizi000. I've looked into this further: Option 1: Raising chunk/loop limitsI have confirmed that the verifier would allow us to raise the 20min perf test for
|
incomplete_reason | lazy_parsing_enabled | protocol | total_bytes_lost_in_gaps |
---|---|---|---|
kExceededLoopLimit | false | kProtocolHTTP | ~ 17.8 MiB lost |
kFiller | false | kProtocolHTTP | ~ 17.8 MiB in filler allocations |
20min perf test for px-k8ssandra
with 21x higher loop limit
With a higher loop limit, we lose significantly less HTTP data (1.2MiB instead of 17.8 MiB).
incomplete_reason | lazy_parsing_enabled | protocol | total_bytes_lost_in_gaps |
---|---|---|---|
kExceededLoopLimit | false | kProtocolHTTP | ~ 1.2 MiB |
kFiller | false | kProtocolHTTP | ~1.2 MiB |
Specific perf differences are hard to determine from the aggregate graphs, but this at least rules out a major degradation.
Metrics for gap size and bytes discarded before gap
I have added prometheus metrics to keep track of the number of bytes we miss (the size of the gap) in the socket tracer for several gap conditions. This metric shows us the prevalence of gaps across protocols as well as the ingest impact of raising chunk/loop limits. Additionally, I am keeping track of the number of bytes rendered unparseable due to the presence of a gap (i.e. the bytes we discard because we can't process the incomplete chunk).
With the prometheus metrics deployed in production we should see how frequently these cases arise in the wild and how much more data ingest we could get by implementing lazy parsing on a per protocol basis, or raising the loop/chunk limits.
Gap Conditions
I have identified several conditions under which we'd see a gap in the data stream buffer due to the behavior of our bpf code.
process_syscall_sendfile
Commit c61cbbe added support for tracing sendfile
calls used by some servers to write data to a socket. We can't trace the data itself, but instead fill the missing data with null bytes up to 1MB. If the data is larger than 1MB, we create a gap in the data stream buffer (the associated incomplete chunk is most probably dropped in the event parser i.e. ParseFramesLoop
returns kInvalid
once the parser reaches the gap, dropping the contiguous head).
Note that the metadata (e.g. HTTP headers) preceding the file will have been sent and captured before the sendfile syscall and should therefore have been placed before the file in the data stream buffer.
perf_submit_wrapper
On the last iteration of the chunk loop (which chops large messages into CHUNK_LIMIT
-1 parts), we submit all of the remaining bytes of the msg and truncate if it is larger than MAX_MSG_SIZE
(32 KiB). ExtractFillerEvent
then attempts to fill the truncated data up to kMaxFilledSizeBytes
(1MB). If it is larger than this, we end up with a gap.
perf_submit_iovecs
There are two cases I've identified in which we end up with an incomplete chunk in perf_submit_iovecs
. The first is when we exceed the LOOP_LIMIT
(i.e. the msg is chopped up into more iovecs than we can process) and the second is when the iov_size
exceeds the maximum filler event size (i.e. the msg is divided into individually large iovecs). I have confirmed that both of these conditions actually occur in practice in the CQL
and NodeJS
bpf tests (more details in the BPF tests section below).
Note that we don't know anything about the number or size of frames contained within these bytes - we could have one very big frame stretched across multiple iovecs, or many smaller frames in a single iovec. Either way, having gaps (with or without filler bytes) will most likely cause parsing to fail.
Frequency of these gap conditions
I have verified that the following gap conditions are triggered in our demo apps / bpf tests:
BPF tests:
- cql_trace_bpf_test:
iovExceedsMaxMsgSize
- nodejs_trace_bpf_test (new bpf test adapted from a community submitted app, see [Parse incomplete chunks 9/9] Add bpf test for loop limit exceeded #1794):
iovExceedsLooplimit
where iov_len = 257
Demos
px-k8ssandra
(run for 1 hour):
Turning off filler and parsing lazily (as far as possible up to the gap) could save us 56 MB of null byte allocations per hour.
incomplete_reason | lazy_parsing_enabled | protocol | total_bytes_lost_in_gaps |
---|---|---|---|
kExceededLoopLimit | false | kProtocolHTTP | ~ 53.3MiB lost |
kFiller | false | kProtocolHTTP | ~ 53.3 MiB in filler allocations |
Note that in this particular case the traffic doesn't seem to reach our parser (it gets purged from the data stream buffer due to a connection resolution error printed below), which is why the unparseable_bytes_before_gap
metric is 0.
W20231120 18:51:18.174887 1391373 conn_tracker.cc:421] Resetting DataStream buffers. Disabling connection dest=10.98.4.126:9966 reason=No client-side tracing: Remote endpoint is inside the cluster.
W20231120 18:51:18.175134 1391373 conn_tracker.cc:421] Resetting DataStream buffers. Disabling connection dest=-:-1 reason=Could not determine role for traffic because connection resolution failed.
incomplete_reason | lazy_parsing_enabled | protocol | unparseable_bytes_before_gap |
---|---|---|---|
kFullyFormed | false | kProtocolHTTP | 0.0 |
px-kafka
(run for 30min)
The kafka demo features a significant number of sendfile events, which we currently pad with filler. Turning off filler events and parsing lazily could save us ~175MB every 30min.
incomplete_reason | lazy_parsing_enabled | protocol | total_bytes_lost_in_gaps |
---|---|---|---|
kHeaderEvent | false | kProtocolKafka | 0.0 |
kFiller | false | kProtocolKafka | ~166.4 MiB in filler allocations |
kSendFile | false | kProtocolKafka | ~ 166.4 MiB lost |
Note that sendfile events contain no data (the metadata is transferred previously) so our unparseable_bytes_before_gap
is zero as well. With lazy parsing, we could part with filler events and just process the metadata.
Option 2: lazy parsing
Instead of relying on filler events with large null byte allocations, we could modify the protocol parsers to run lazily, parsing as far as possible. Lazy parsing would let us capture partial frames if we know from bpf that a gap is coming. To avoid masking unrelated errors, we should push a partial frame only if we know there is a gap coming. In the sendfile
case, we only know at the start of the file (after the metadata is captured) that there is a gap coming up, so a lazy parser would need to return a ParseState akin to kMetadataComplete
, where metadata is the minimum amount of information we need to successfully stitch a frame.
I have put together a possible lazy implementation of the HTTP parser on #1789. Any feedback is much appreciated.
Originally, I considered only pushing a partial frame if the protocol parser got past the incomplete event start. This turned out to be difficult to implement as events can be very small. For instance, a community submitted sample NodeJS application has an iovec with 257 elements each of which contains only ~16 bytes. With a loop limit of 42, we merge 41 events into the data stream buffer, getting us to 878 bytes. Then on 42nd iov we reach the loop limit, so our incomplete event begins at 878th byte. When the protocol parser receives the head with this incomplete event, it is able to parse up to the 157th byte (kMetdataComplete
). If we only pushed a partial frame after the incomplete event start (i.e. after position 878), we would miss the partial frame at the start. So instead, the lazy parser I built treats a kMetadataComplete
in a contiguous head with an incomplete chunk as a valid partial frame. It also adds a DCHECK
to ensure that a contiguous head that ends in a gap should contain at most one partial frame.
Related PRs
#1786, #1787, #1788, #1789, #1790, #1791, #1792, #1793, #1794, #1795
Summary: Dynamically increase the loop limit for newer kernels with higher instruction limits (1 million for kernels > 5.1) by 21x to reduce data loss and raise ingest. More details in #1755. One open question is whether we want to add vizier flag to toggle this behavior in case there are unforseen performance bottlenecks for certain clusters. Type of change: /kind feature Test Plan: Existing targets + perf/demo tests outlined in #1755. --------- Signed-off-by: Benjamin Kilimnik <bkilimnik@pixielabs.ai>
Problem Description
Pixie is unable to capture syscalls with an iovcnt
>42
or a message size>120 KiB
.These variables are set conservatively to keep the instruction count below BPF's limit for version 4 kernels (4096 per probe). These limits, however, result in data loss and incomplete syscall tracing. For example, in a community-shared NodeJS application transferring just 10kB of data, the iovec array contained 257 entries, which is well beyond the current
LOOP_LIMIT
of 42. We've also seen the message size (CHUNK_LIMIT
) exceeded in k8ssandra deployments. This is very likely an issue across all protocols.Proposed Solutions
FLAGS_max_body_bytes
for HTTP). However, capturing complete metadata could still be invaluable, conveying headers, response codes, and other important information.LOOP/CHUNK
limits, pass metadata to the event parser, which attempts to process a partial frame. For this to work, protocol parsers must be modified to work lazily, parsing as far as possible and returning a new parseStatekPartialFrame
when they've processed enough bytes to capture essential metadata.LOOP/CHUNK
limit is reached, the event parser will eventually receive a contiguous head of the data stream buffer that ends with a gap representing the bytes we missed. Note that there could be any number of valid frames before the gap because Pixie's sampling frequency is greater than its push frequency (sampling
is used loosely here, as Pixie receives every event and not a subset of them). Moreover, the application itself could be batching messages such that an incomplete chunk could contain a number of valid frames before the gap.In BPF, we can determine the full message size and keep track of how many bytes were missed if the
LOOP/CHUNK
limit is reached. We can pass this information through the event to the datastream buffer, so that the event parser knows when to expect an incomplete chunk. ADCHECK
would enforce that for a given call toParseFramesLoop
with a contiguous head, a partial frame is pushed at most once since we expect to only reach the gap once.To avoid potential side effects from using the
kPartialFrame
state (i.e. prevent it from masking other errors), we could use a heuristic to determine if this partial frame was caused by a lack of bytes. We could store the max size of fields that we could possibly parse in the metadata of a frame. If this is greater than the number of bytes remaining, then we hit our gap, so it makes sense to push a partial frame. If however, we have sufficient bytes remaining to parse these fields, then a different error likely occurred and we don't want to push the partial frame.The text was updated successfully, but these errors were encountered: