Vector logs an error on packets that are too large #13175

cholcombe973 · 2022-06-15T21:58:14Z

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

When processing entries vector encounters some input that is larger than it can handle:

 ERROR sink{component_kind="sink" component_id=metrics_to_datadog component_type=datadog_metrics component_name=metrics_to_datadog}: vector::internal_events::datadog_metrics: Failed to encode Datadog metrics. error=A split payload was still too big to encode/compress with…

It appears that the maximum payload size is limited.
When vector encounters packets that would need to be split it logs an error instead.

Configuration

# Vector's API for introspection
[api]
  enabled = true
  address = "127.0.0.1:8686"

# Vector's own internal metrics
[sources.internal_logs]
  type = "internal_logs"

[sources.internal_metrics]
type = "internal_metrics"
scrape_interval_secs = 2

[sources.datadog_agents]
type = "datadog_agent"
address = "[::]:8564"
multiple_outputs = true

[transforms.tag_logs]
type = "remap"
inputs = [ "datadog_agents.logs" ]
source = """
# Parse the received .ddtags field so we can more easily access the contained tags, set to empty object if parsing fails
.ddtags = parse_key_value(.ddtags, key_value_delimiter: ":", field_delimiter: ",") ?? {}
.ddtags.sender = "vector"
# Re-encode Datadog tags as a string for the `datadog_logs` sink
.ddtags = encode_key_value(.ddtags, key_value_delimiter: ":", field_delimiter: ",")
"""

[transforms.tag_metrics]
type = "remap"
inputs = [ "datadog_agents.metrics" ]
source = """
.tags.sender = "vector"
"""

[sinks.log_to_datadog]
type = "datadog_logs"
inputs = [ "internal_logs", "tag_logs" ]
default_api_key = "REDACTED"

[sinks.metrics_to_datadog]
type = "datadog_metrics"
inputs = [ "internal_metrics", "tag_metrics" ]
default_api_key = "REDACTED"

Version

vector-0.22.0-1

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

The text was updated successfully, but these errors were encountered:

tobz · 2022-06-16T00:02:11Z

So, this error is actually emitted when it tries to split an input batch but fails to do so successfully i.e. the new requests from the split inputs are still too big.

It's definitely a little weird that it would fail after splitting since the limit is 3MB/62MB compressed/uncompressed... which can typically hold a lot of metrics.

What do the component_discarded_events_total and component_errors_total counters look like for you when this happens? Specifically for the datadog_metrics sink. Knowing how many events are discarded, divided by the number of errors, would give us an indication of how large the split-but-still-too-big requests are.. and whether we're dealing with some sort of pathologically-sized distribution or just a lot of series or what.

cholcombe973 · 2022-06-16T15:57:03Z

Looks like in the past hour on datadog's graphs that this has averaged 52K discarded_events_total and 10.95 component_errors_total

Sg-23 · 2022-12-02T10:56:37Z

I am also facing a similar issue while sending the metrics from datadog-agent -> vector-agent -> aggregator -> datadog.
This works fine for a few clusters but as soon as the traffic is deployed to all the clusters, metrics start dropping intermittently with the below errors.
Below are the errors I am able to see at different steps:

In datadog-agent:
2022-12-01 13:16:17 UTC | CORE | ERROR | (pkg/forwarder/worker.go:180 in process) | Too many errors for endpoint 'http://10.216.176.236:8282/api/beta/sketches': retrying later

In vector-agent:
2022-12-01T07:12:22.691704Z WARN sink{component_kind="sink" component_id=metric-ingest component_type=vector component_name=metric-ingest}: vector::sinks::util::retries: Request timed out. If this happens often while the events are actually reaching their destination, try decreasing batch.max_bytesand/or usingcompressionif applicable. Alternativelyrequest.timeout_secs can be increased.

In aggregator:
2022-12-01T06:31:48.293477Z ERROR sink{component_kind="sink" component_id=eg_datadog component_type=datadog_metrics component_name=eg_datadog}: vector::internal_events::datadog_metrics: Failed to encode Datadog metrics. error=A split payload was still too big to encode/compress withing size limits. error_code=split_failed error_type="encoder_failed" stage="processing"

Datadog agent version: 7.40.1-jmx
Datadog cluster-agent version: 7.40.1
Vector version: 0.24.1-debian

## Context When support was added for encoding/sending sketches in #9178, logic was added to handle "splitting" payloads if a metric exceeded the (un)compressed payload limits. As we lacked (at the time) the ability to encode sketch metrics one-by-one, we were forced to collect all of them, and then attempt to encode them all at once, which had a tendency to grow the response size past the (un)compressed payload limits. This "splitting" mechanism allowed us to compensate for that. However, in order to avoid getting stuck in pathological loops where payloads were too big, and thus required multiple splits (after already attempting at least one split), the logic was configured such that a batch of metrics would only be split once, and if the two subsequent slices couldn't be encoded without also exceeding the limits, they would be dropped and we would give up trying to split further. Despite the gut feeling during that work that it should be exceedingly rare to ever need to split further, real life has shown otherwise: #13175 ## Solution This PR introduces proper incremental encoding of sketches, which doesn't eliminate the possibility of needing to split (more below) but significantly reduces the likelihood that splitting will need to happen down to a purely theoretical level. We're taking advantage of hidden-from-docs methods in `prost` to encode each `SketchPayload` object and append the bytes into a single buffer. This is possible due to how Protocol Buffers functions. Additionally, we're now generating "file descriptors" for our compiled Protocol Buffers definitions. We use this to let us programmatically query the field number of the "sketches" field in the `SketchPayload` message, which is a slightly more robust way than just hardcoding it and hoping it doesn't ever change in the future. In Protocol Buffers, each field in a message is written out such that the field data is preceded by the field number. This is part and parcel to its ability to allow for backwards compatible changes to a definition. Further, for repeated fields -- i.e. `Vec<Sketch>` -- the repetitive nature is determined simply by write the same field multiple times rather than needing to write everything all together. Practically speaking, this means that we can encode a vector of two messages, or encode those two messages individually, and end up with the same encoded output of `[field N][field data][field N][field data]`. ### Ancillary changes We've additionally fixed a bug with the "bytes sent" metric being reported for the `datadog_metrics` sink due to some very tangled and miswired code around how compressed/uncompressed/event bytes/etc sizes were being shuttled from the request builder logic down to `Driver`. We've also reworked some of the encoder error types just to clean them up and simplify things a bit. ## Reviewer notes ### Still needing to handle splits The encoder still does need to care about splits, in a theoretical sense, because while we can accurately track and avoid ever exceeding the uncompressed payload limit, we can't know the final compressed payload size until we finalize the builder/payload. Currently, the encoder does a check to see if adding the current metric would cause us to exceed the compressed payload limit, assuming the compressor couldn't actually compress the encoded metric at all. This is a fairly robust check since it tries to optimally account for the overhead of an entirely incompressible payload, and so on... but we really want to avoid dropping events if possible, obviously, and that's why the splitting code is still in place.

jszwedko · 2023-06-30T17:42:34Z

Closed by #17764

cholcombe973 added the type: bug A code related bug. label Jun 15, 2022

jszwedko added the sink: datadog_metrics Anything `datadog_metrics` sink related label Jun 16, 2022

zsherman assigned tobz Jun 21, 2022

bruceg self-assigned this Jun 23, 2022

jszwedko unassigned tobz and bruceg Jan 11, 2023

tobz mentioned this issue Jun 26, 2023

chore(datadog_metrics sink): incrementally encode sketches #17764

Merged

jszwedko closed this as completed Jun 30, 2023

jszwedko mentioned this issue Jul 6, 2023

Vector batch bytes limits are based on in-memory sizing of events #10020

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vector logs an error on packets that are too large #13175

Vector logs an error on packets that are too large #13175

cholcombe973 commented Jun 15, 2022 •

edited by jszwedko

Loading

tobz commented Jun 16, 2022

cholcombe973 commented Jun 16, 2022

Sg-23 commented Dec 2, 2022

jszwedko commented Jun 30, 2023

Vector logs an error on packets that are too large #13175

Vector logs an error on packets that are too large #13175

Comments

cholcombe973 commented Jun 15, 2022 • edited by jszwedko Loading

A note for the community

Problem

Configuration

Version

Debug Output

Example Data

Additional Context

References

tobz commented Jun 16, 2022

cholcombe973 commented Jun 16, 2022

Sg-23 commented Dec 2, 2022

jszwedko commented Jun 30, 2023

cholcombe973 commented Jun 15, 2022 •

edited by jszwedko

Loading