Receive drops all incoming data if clock issues #6167

edgrz · 2023-02-27T15:08:48Z

Thanos, Prometheus and Golang version used:
Thanos version:

thanos, version 0.30.2 (branch: HEAD, revision: fe3f5d24192570038e9576307e1b31794920a1f3)
  build user:       root@6213df2115a5
  build date:       20230131-20:03:09
  go version:       go1.19.5
  platform:         linux/amd64

Prometheus using remote_write:

prometheus, version 2.39.1 (branch: HEAD, revision: dcd6af9e0d56165c6f5c64ebbc1fae798d24933a)
  build user:       root@273d60c69592
  build date:       20221007-15:57:09
  go version:       go1.19.2
  platform:         linux/amd64

Object Storage Provider: s3

What happened:
As it was discussed in the following closed issue, thanos receive stops processing incoming data from sources whenever it receives a data series which is ahead in time, it stops dropping all metrics from any prometheus instance using remote_write.

The behavior is exactly what @mtlang was stating: #3765 (comment)

From my perspective looks like if by any chance any source forwarding ingesting data to Thanos receive sends a metric with the wrong time (future), Thanos stops considering current time as the "proper" one and starts to drop all metrics stating:

level=warn ts=2023-02-27T12:54:57.827885697Z caller=writer.go:188 component=receive component=receive-writer tenant=default-tenant msg="Error on ingesting samples that are too old or are too far into the future" numDropped=1663
level=warn ts=2023-02-27T12:54:57.937124259Z caller=writer.go:188 component=receive component=receive-writer tenant=default-tenant msg="Error on ingesting samples that are too old or are too far into the future" numDropped=1648

On prometheus, we are receiving 409 Conflicts.

What you expected to happen:
The data series which is ahead in time should be ignored by Thanos receive and so, allowing the rest of prometheus instance, which has correct NTP keep sending metrics.

How to reproduce it (as minimally and precisely as possible):

2 prometheus instance (in my case different k8s clusters) configured to send metrics using remote_write to a Thanos receive in another cluster. Then, change date in one of them as:

sudo timedatectl set-ntp 0
sudo timedatectl set-time 'YYYY-MM-DD HH:MM:ss' # somewhere in the future

Full logs to relevant components:

Thanos receive:

level=warn ts=2023-02-27T12:54:57.370782169Z caller=writer.go:188 component=receive component=receive-writer tenant=default-tenant msg="Error on ingesting samples that are too old or are too far into the future" numDropped=1671
level=warn ts=2023-02-27T12:54:57.465816704Z caller=writer.go:188 component=receive component=receive-writer tenant=default-tenant msg="Error on ingesting samples that are too old or are too far into the future" numDropped=1654
level=warn ts=2023-02-27T12:54:57.54173344Z caller=writer.go:188 component=receive component=receive-writer tenant=default-tenant msg="Error on ingesting samples that are too old or are too far into the future" numDropped=1607
level=warn ts=2023-02-27T12:54:57.634812264Z caller=writer.go:188 component=receive component=receive-writer tenant=default-tenant msg="Error on ingesting samples that are too old or are too far into the future" numDropped=1645
level=warn ts=2023-02-27T12:54:57.721574977Z caller=writer.go:188 component=receive component=receive-writer tenant=default-tenant msg="Error on ingesting samples that are too old or are too far into the future" numDropped=1619
level=warn ts=2023-02-27T12:54:57.827885697Z caller=writer.go:188 component=receive component=receive-writer tenant=default-tenant msg="Error on ingesting samples that are too old or are too far into the future" numDropped=1663
level=warn ts=2023-02-27T12:54:57.937124259Z caller=writer.go:188 component=receive component=receive-writer tenant=default-tenant msg="Error on ingesting samples that are too old or are too far into the future" numDropped=1648
level=warn ts=2023-02-27T12:55:03.020531616Z caller=writer.go:188 component=receive component=receive-writer tenant=default-tenant msg="Error on ingesting samples that are too old or are too far into the future" numDropped=996
level=warn ts=2023-02-27T12:55:08.071377828Z caller=writer.go:188 component=receive component=receive-writer tenant=default-tenant msg="Error on ingesting samples that are too old or are too far into the future" numDropped=600

Prometheus clients:

ts=2023-02-27T13:26:29.117Z caller=dedupe.go:112 component=remote level=error remote_name=4ddd6d url=https://thanos-receiver.edgar-270222.staging.anywhere.navify.com/api/v1/receive msg="non-recoverable error" count=5000 exemplarCount=0 err="server returned HTTP status 409 Conflict: 3 errors: forwarding request to endpoint thanos-receive-ingestor-default-0.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-ingestor-default-0.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: add 1648 samples: out of bounds; forwarding request to endpoint thanos-receive-ingestor-default-1.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-ingestor-default-1.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: add 1682 samples: out of bounds; forwarding request to endpoint thanos-receive-ingestor-default-2.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-ingestor-default-2.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: add 1670 samples: out of bounds"
ts=2023-02-27T13:26:34.180Z caller=dedupe.go:112 component=remote level=error remote_name=4ddd6d url=https://thanos-receiver.edgar-270222.staging.anywhere.navify.com/api/v1/receive msg="non-recoverable error" count=3238 exemplarCount=0 err="server returned HTTP status 409 Conflict: 3 errors: forwarding request to endpoint thanos-receive-ingestor-default-0.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-ingestor-default-0.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: add 1072 samples: out of bounds; forwarding request to endpoint thanos-receive-ingestor-default-1.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-ingestor-default-1.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: add 1105 samples: out of bounds; forwarding request to endpoint thanos-receive-ingestor-default-2.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-ingestor-default-2.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: add 1061 samples: out of bounds"
ts=2023-02-27T13:26:39.223Z caller=dedupe.go:112 component=remote level=error remote_name=4ddd6d url=https://thanos-receiver.edgar-270222.staging.anywhere.navify.com/api/v1/receive msg="non-recoverable error" count=1451 exemplarCount=0 err="server returned HTTP status 409 Conflict: 3 errors: forwarding request to endpoint thanos-receive-ingestor-default-0.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-ingestor-default-0.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: add 524 samples: out of bounds; forwarding request to endpoint thanos-receive-ingestor-default-1.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-ingestor-default-1.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: add 469 samples: out of bounds; forwarding request to endpoint thanos-receive-ingestor-default-2.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-ingestor-default-2.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: add 458 samples: out of bounds"

The text was updated successfully, but these errors were encountered:

fpetkovski · 2023-02-27T20:29:21Z

I am not sure if we can easily solve it because it's a Prometheus TSDB behavior. Enabling out-of-order ingestion could help, but it's still an experimental feature that hasn't been battle tested.

defreng · 2023-02-27T22:29:25Z

@fpetkovski

wouldn't it be fairly easy for Thanos to just drop samples that are incoming and more than XX (configurable?) seconds in the future according to the server time from thanos-receive?

jnyi · 2023-03-08T02:15:06Z

+1 on this issue, we encountered this today, the TSDB head is March 12, 2023 as I comment today, any ideas @fpetkovski ?

jnyi · 2023-03-08T02:20:29Z

So we enabled this experimental flag "--tsdb.out-of-order.time-window=15m" which works fine for too old timestamp (within 2hr), however from prometheus upstream, it doesn't have a way to protect itself to append a future timestamp: https://github.com/prometheus/prometheus/blob/main/tsdb/head_append.go#L401

fpetkovski · 2023-03-08T05:42:52Z

@defreng What you suggested makes sense to me. Having some sort of an upper bound on the timestamp should at least prevent big losses of data.

jnyi · 2023-03-08T07:56:04Z

Hi Guys, put up a PR to address this issue, since it is my first time, welcome some early feedback (will add change log if it is good to go).

GiedriusS · 2024-04-04T10:46:55Z

There is now a flag to avoid ingesting samples that are too far into the future hence closing this.

jnyi mentioned this issue Mar 8, 2023

Add an experimental flag to block samples with timestamp too far in the future #6195

Merged

2 tasks

GiedriusS closed this as completed Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Receive drops all incoming data if clock issues #6167

Receive drops all incoming data if clock issues #6167

edgrz commented Feb 27, 2023

fpetkovski commented Feb 27, 2023 •

edited

Loading

defreng commented Feb 27, 2023

jnyi commented Mar 8, 2023

jnyi commented Mar 8, 2023

fpetkovski commented Mar 8, 2023

jnyi commented Mar 8, 2023

GiedriusS commented Apr 4, 2024

Receive drops all incoming data if clock issues #6167

Receive drops all incoming data if clock issues #6167

Comments

edgrz commented Feb 27, 2023

fpetkovski commented Feb 27, 2023 • edited Loading

defreng commented Feb 27, 2023

jnyi commented Mar 8, 2023

jnyi commented Mar 8, 2023

fpetkovski commented Mar 8, 2023

jnyi commented Mar 8, 2023

GiedriusS commented Apr 4, 2024

fpetkovski commented Feb 27, 2023 •

edited

Loading