-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prometheus remote write serialiser drops buffered metrics #11682
Comments
next steps: review PR |
Any updates on this? |
As the next steps state, take a look at the discussion on the PR in #11683, specifically this comment. |
I followed the other related threads and it seems like some work was done, but we recently confirmed that this is still happening with v1.28.1 when we write to Prometheus directly from Telegraf with batching enabled. With batching disabled, no metrics are dropped, but I'm not sure if I'd stay happy enough with that workaround. I also tried the |
AFAIR another workaround is to use a small batch size - we did some tests and it seems that if the batch size is smaller than the number of collected metrics/series, the metrics will not be dropped. Still far from ideal, but better than no batching at all. |
This bug still should be fixed, but in the meantime, we're going to trial influx2cortex and just get our Telegrafs to write using the influxdb output. |
Relevant telegraf.conf
Logs from Telegraf
I wrote a sample python http server that accepts the remote-write protobuf payloads, and prints them to stdout (ignore the 1 hour difference in timestamps, python was logging local time):
System info
All platforms, all versions of telegraf with prometheusremotewrite serialiser
Docker
No response
Steps to reproduce
Note that this may be difficult to observe under a healthy production scenario. However this is very apparent when you experience a remote-write endpoint outage that lasts long enough for Cortex/Mimir to reveal large gaps in your time series. When the remote-write endpoint is eventually restored, telegraf appears to write all metrics in the buffer out to the endpoint. But when you look at the time series data, you will observe that large gaps in the time series data.
Expected behavior
The prometheus remote write serialiser should collate all buffered metric samples for sending to the remote write endpoint. This is in contrast to the prometheus client output plugin, which must only present the latest metric samples for scraping.
The primary scenario where this matters is when the remote write endpoint is unreachable. Telegraf will buffer the metrics so that when the remote write endpoint is reachable again, all previous metrics in the buffer are sent to the remote write endpoint.
Actual behavior
Telegraf only sends one sample of each time series in each batch of metrics. The prometheus remote write serialiser incorrectly assumes that in any given batch of metrics , only the latest sample should be sent. All older samples for a given metric are dropped, causing the majority of metric samples buffered during a remote write endpoint outage to be lost. The actual number of samples lost depends on the maximum number of metrics per batch, and the number of unique time series per batch.
Additional info
The core logic causing the issue is located here:
https://github.com/influxdata/telegraf/blob/master/plugins/serializers/prometheusremotewrite/prometheusremotewrite.go#L198
The serialiser builds a map of time series metrics where each metric key only contains a timeseries object only with a single sample. If a metric with a newer sample in the batch is found, it replaces the older sample. All older metrics are silently lost.
I am working on a PR to correct this behaviour, where a given metric time series contains all samples in a batch, ordered chronologically.
The text was updated successfully, but these errors were encountered: