Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vector logs an error on packets that are too large #13175

Closed
cholcombe973 opened this issue Jun 15, 2022 · 4 comments
Closed

Vector logs an error on packets that are too large #13175

cholcombe973 opened this issue Jun 15, 2022 · 4 comments
Labels
sink: datadog_metrics Anything `datadog_metrics` sink related type: bug A code related bug.

Comments

@cholcombe973
Copy link

cholcombe973 commented Jun 15, 2022

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

When processing entries vector encounters some input that is larger than it can handle:

 ERROR sink{component_kind="sink" component_id=metrics_to_datadog component_type=datadog_metrics component_name=metrics_to_datadog}: vector::internal_events::datadog_metrics: Failed to encode Datadog metrics. error=A split payload was still too big to encode/compress with…

It appears that the maximum payload size is limited.
When vector encounters packets that would need to be split it logs an error instead.

Configuration

# Vector's API for introspection
[api]
  enabled = true
  address = "127.0.0.1:8686"

# Vector's own internal metrics
[sources.internal_logs]
  type = "internal_logs"

[sources.internal_metrics]
type = "internal_metrics"
scrape_interval_secs = 2

[sources.datadog_agents]
type = "datadog_agent"
address = "[::]:8564"
multiple_outputs = true

[transforms.tag_logs]
type = "remap"
inputs = [ "datadog_agents.logs" ]
source = """
# Parse the received .ddtags field so we can more easily access the contained tags, set to empty object if parsing fails
.ddtags = parse_key_value(.ddtags, key_value_delimiter: ":", field_delimiter: ",") ?? {}
.ddtags.sender = "vector"
# Re-encode Datadog tags as a string for the `datadog_logs` sink
.ddtags = encode_key_value(.ddtags, key_value_delimiter: ":", field_delimiter: ",")
"""

[transforms.tag_metrics]
type = "remap"
inputs = [ "datadog_agents.metrics" ]
source = """
.tags.sender = "vector"
"""

[sinks.log_to_datadog]
type = "datadog_logs"
inputs = [ "internal_logs", "tag_logs" ]
default_api_key = "REDACTED"

[sinks.metrics_to_datadog]
type = "datadog_metrics"
inputs = [ "internal_metrics", "tag_metrics" ]
default_api_key = "REDACTED"

Version

vector-0.22.0-1

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

@cholcombe973 cholcombe973 added the type: bug A code related bug. label Jun 15, 2022
@tobz
Copy link
Contributor

tobz commented Jun 16, 2022

So, this error is actually emitted when it tries to split an input batch but fails to do so successfully i.e. the new requests from the split inputs are still too big.

It's definitely a little weird that it would fail after splitting since the limit is 3MB/62MB compressed/uncompressed... which can typically hold a lot of metrics.

What do the component_discarded_events_total and component_errors_total counters look like for you when this happens? Specifically for the datadog_metrics sink. Knowing how many events are discarded, divided by the number of errors, would give us an indication of how large the split-but-still-too-big requests are.. and whether we're dealing with some sort of pathologically-sized distribution or just a lot of series or what.

@cholcombe973
Copy link
Author

Looks like in the past hour on datadog's graphs that this has averaged 52K discarded_events_total and 10.95 component_errors_total

@jszwedko jszwedko added the sink: datadog_metrics Anything `datadog_metrics` sink related label Jun 16, 2022
@bruceg bruceg self-assigned this Jun 23, 2022
@Sg-23
Copy link

Sg-23 commented Dec 2, 2022

I am also facing a similar issue while sending the metrics from datadog-agent -> vector-agent -> aggregator -> datadog.
This works fine for a few clusters but as soon as the traffic is deployed to all the clusters, metrics start dropping intermittently with the below errors.
Below are the errors I am able to see at different steps:

In datadog-agent:
2022-12-01 13:16:17 UTC | CORE | ERROR | (pkg/forwarder/worker.go:180 in process) | Too many errors for endpoint 'http://10.216.176.236:8282/api/beta/sketches': retrying later

In vector-agent:
2022-12-01T07:12:22.691704Z WARN sink{component_kind="sink" component_id=metric-ingest component_type=vector component_name=metric-ingest}: vector::sinks::util::retries: Request timed out. If this happens often while the events are actually reaching their destination, try decreasing batch.max_bytesand/or usingcompressionif applicable. Alternativelyrequest.timeout_secs can be increased.

In aggregator:
2022-12-01T06:31:48.293477Z ERROR sink{component_kind="sink" component_id=eg_datadog component_type=datadog_metrics component_name=eg_datadog}: vector::internal_events::datadog_metrics: Failed to encode Datadog metrics. error=A split payload was still too big to encode/compress withing size limits. error_code=split_failed error_type="encoder_failed" stage="processing"

Datadog agent version: 7.40.1-jmx
Datadog cluster-agent version: 7.40.1
Vector version: 0.24.1-debian

@jszwedko jszwedko unassigned tobz and bruceg Jan 11, 2023
github-merge-queue bot pushed a commit that referenced this issue Jun 30, 2023
## Context

When support was added for encoding/sending sketches in #9178, logic was
added to handle "splitting" payloads if a metric exceeded the
(un)compressed payload limits. As we lacked (at the time) the ability to
encode sketch metrics one-by-one, we were forced to collect all of them,
and then attempt to encode them all at once, which had a tendency to
grow the response size past the (un)compressed payload limits. This
"splitting" mechanism allowed us to compensate for that.

However, in order to avoid getting stuck in pathological loops where
payloads were too big, and thus required multiple splits (after already
attempting at least one split), the logic was configured such that a
batch of metrics would only be split once, and if the two subsequent
slices couldn't be encoded without also exceeding the limits, they would
be dropped and we would give up trying to split further.

Despite the gut feeling during that work that it should be exceedingly
rare to ever need to split further, real life has shown otherwise:
#13175

## Solution

This PR introduces proper incremental encoding of sketches, which
doesn't eliminate the possibility of needing to split (more below) but
significantly reduces the likelihood that splitting will need to happen
down to a purely theoretical level.

We're taking advantage of hidden-from-docs methods in `prost` to encode
each `SketchPayload` object and append the bytes into a single buffer.
This is possible due to how Protocol Buffers functions. Additionally,
we're now generating "file descriptors" for our compiled Protocol
Buffers definitions. We use this to let us programmatically query the
field number of the "sketches" field in the `SketchPayload` message,
which is a slightly more robust way than just hardcoding it and hoping
it doesn't ever change in the future.

In Protocol Buffers, each field in a message is written out such that
the field data is preceded by the field number. This is part and parcel
to its ability to allow for backwards compatible changes to a
definition. Further, for repeated fields -- i.e. `Vec<Sketch>` -- the
repetitive nature is determined simply by write the same field multiple
times rather than needing to write everything all together. Practically
speaking, this means that we can encode a vector of two messages, or
encode those two messages individually, and end up with the same encoded
output of `[field N][field data][field N][field data]`.

### Ancillary changes

We've additionally fixed a bug with the "bytes sent" metric being
reported for the `datadog_metrics` sink due to some very tangled and
miswired code around how compressed/uncompressed/event bytes/etc sizes
were being shuttled from the request builder logic down to `Driver`.

We've also reworked some of the encoder error types just to clean them
up and simplify things a bit.

## Reviewer notes

### Still needing to handle splits

The encoder still does need to care about splits, in a theoretical
sense, because while we can accurately track and avoid ever exceeding
the uncompressed payload limit, we can't know the final compressed
payload size until we finalize the builder/payload.

Currently, the encoder does a check to see if adding the current metric
would cause us to exceed the compressed payload limit, assuming the
compressor couldn't actually compress the encoded metric at all. This is
a fairly robust check since it tries to optimally account for the
overhead of an entirely incompressible payload, and so on... but we
really want to avoid dropping events if possible, obviously, and that's
why the splitting code is still in place.
@jszwedko
Copy link
Member

Closed by #17764

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sink: datadog_metrics Anything `datadog_metrics` sink related type: bug A code related bug.
Projects
None yet
Development

No branches or pull requests

5 participants