Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Receive drops all incoming data if clock issues #6167

Closed
edgrz opened this issue Feb 27, 2023 · 7 comments
Closed

Receive drops all incoming data if clock issues #6167

edgrz opened this issue Feb 27, 2023 · 7 comments

Comments

@edgrz
Copy link

edgrz commented Feb 27, 2023

Thanos, Prometheus and Golang version used:
Thanos version:

thanos, version 0.30.2 (branch: HEAD, revision: fe3f5d24192570038e9576307e1b31794920a1f3)
  build user:       root@6213df2115a5
  build date:       20230131-20:03:09
  go version:       go1.19.5
  platform:         linux/amd64

Prometheus using remote_write:

prometheus, version 2.39.1 (branch: HEAD, revision: dcd6af9e0d56165c6f5c64ebbc1fae798d24933a)
  build user:       root@273d60c69592
  build date:       20221007-15:57:09
  go version:       go1.19.2
  platform:         linux/amd64

Object Storage Provider: s3

What happened:
As it was discussed in the following closed issue, thanos receive stops processing incoming data from sources whenever it receives a data series which is ahead in time, it stops dropping all metrics from any prometheus instance using remote_write.

The behavior is exactly what @mtlang was stating: #3765 (comment)

From my perspective looks like if by any chance any source forwarding ingesting data to Thanos receive sends a metric with the wrong time (future), Thanos stops considering current time as the "proper" one and starts to drop all metrics stating:

level=warn ts=2023-02-27T12:54:57.827885697Z caller=writer.go:188 component=receive component=receive-writer tenant=default-tenant msg="Error on ingesting samples that are too old or are too far into the future" numDropped=1663
level=warn ts=2023-02-27T12:54:57.937124259Z caller=writer.go:188 component=receive component=receive-writer tenant=default-tenant msg="Error on ingesting samples that are too old or are too far into the future" numDropped=1648

On prometheus, we are receiving 409 Conflicts.

What you expected to happen:
The data series which is ahead in time should be ignored by Thanos receive and so, allowing the rest of prometheus instance, which has correct NTP keep sending metrics.

How to reproduce it (as minimally and precisely as possible):

2 prometheus instance (in my case different k8s clusters) configured to send metrics using remote_write to a Thanos receive in another cluster. Then, change date in one of them as:

sudo timedatectl set-ntp 0
sudo timedatectl set-time 'YYYY-MM-DD HH:MM:ss' # somewhere in the future

Full logs to relevant components:

Thanos receive:

level=warn ts=2023-02-27T12:54:57.370782169Z caller=writer.go:188 component=receive component=receive-writer tenant=default-tenant msg="Error on ingesting samples that are too old or are too far into the future" numDropped=1671
level=warn ts=2023-02-27T12:54:57.465816704Z caller=writer.go:188 component=receive component=receive-writer tenant=default-tenant msg="Error on ingesting samples that are too old or are too far into the future" numDropped=1654
level=warn ts=2023-02-27T12:54:57.54173344Z caller=writer.go:188 component=receive component=receive-writer tenant=default-tenant msg="Error on ingesting samples that are too old or are too far into the future" numDropped=1607
level=warn ts=2023-02-27T12:54:57.634812264Z caller=writer.go:188 component=receive component=receive-writer tenant=default-tenant msg="Error on ingesting samples that are too old or are too far into the future" numDropped=1645
level=warn ts=2023-02-27T12:54:57.721574977Z caller=writer.go:188 component=receive component=receive-writer tenant=default-tenant msg="Error on ingesting samples that are too old or are too far into the future" numDropped=1619
level=warn ts=2023-02-27T12:54:57.827885697Z caller=writer.go:188 component=receive component=receive-writer tenant=default-tenant msg="Error on ingesting samples that are too old or are too far into the future" numDropped=1663
level=warn ts=2023-02-27T12:54:57.937124259Z caller=writer.go:188 component=receive component=receive-writer tenant=default-tenant msg="Error on ingesting samples that are too old or are too far into the future" numDropped=1648
level=warn ts=2023-02-27T12:55:03.020531616Z caller=writer.go:188 component=receive component=receive-writer tenant=default-tenant msg="Error on ingesting samples that are too old or are too far into the future" numDropped=996
level=warn ts=2023-02-27T12:55:08.071377828Z caller=writer.go:188 component=receive component=receive-writer tenant=default-tenant msg="Error on ingesting samples that are too old or are too far into the future" numDropped=600

Prometheus clients:

ts=2023-02-27T13:26:29.117Z caller=dedupe.go:112 component=remote level=error remote_name=4ddd6d url=https://thanos-receiver.edgar-270222.staging.anywhere.navify.com/api/v1/receive msg="non-recoverable error" count=5000 exemplarCount=0 err="server returned HTTP status 409 Conflict: 3 errors: forwarding request to endpoint thanos-receive-ingestor-default-0.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-ingestor-default-0.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: add 1648 samples: out of bounds; forwarding request to endpoint thanos-receive-ingestor-default-1.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-ingestor-default-1.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: add 1682 samples: out of bounds; forwarding request to endpoint thanos-receive-ingestor-default-2.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-ingestor-default-2.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: add 1670 samples: out of bounds"
ts=2023-02-27T13:26:34.180Z caller=dedupe.go:112 component=remote level=error remote_name=4ddd6d url=https://thanos-receiver.edgar-270222.staging.anywhere.navify.com/api/v1/receive msg="non-recoverable error" count=3238 exemplarCount=0 err="server returned HTTP status 409 Conflict: 3 errors: forwarding request to endpoint thanos-receive-ingestor-default-0.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-ingestor-default-0.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: add 1072 samples: out of bounds; forwarding request to endpoint thanos-receive-ingestor-default-1.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-ingestor-default-1.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: add 1105 samples: out of bounds; forwarding request to endpoint thanos-receive-ingestor-default-2.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-ingestor-default-2.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: add 1061 samples: out of bounds"
ts=2023-02-27T13:26:39.223Z caller=dedupe.go:112 component=remote level=error remote_name=4ddd6d url=https://thanos-receiver.edgar-270222.staging.anywhere.navify.com/api/v1/receive msg="non-recoverable error" count=1451 exemplarCount=0 err="server returned HTTP status 409 Conflict: 3 errors: forwarding request to endpoint thanos-receive-ingestor-default-0.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-ingestor-default-0.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: add 524 samples: out of bounds; forwarding request to endpoint thanos-receive-ingestor-default-1.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-ingestor-default-1.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: add 469 samples: out of bounds; forwarding request to endpoint thanos-receive-ingestor-default-2.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-ingestor-default-2.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: add 458 samples: out of bounds"
@fpetkovski
Copy link
Contributor

fpetkovski commented Feb 27, 2023

I am not sure if we can easily solve it because it's a Prometheus TSDB behavior. Enabling out-of-order ingestion could help, but it's still an experimental feature that hasn't been battle tested.

@defreng
Copy link

defreng commented Feb 27, 2023

@fpetkovski

wouldn't it be fairly easy for Thanos to just drop samples that are incoming and more than XX (configurable?) seconds in the future according to the server time from thanos-receive?

@jnyi
Copy link
Contributor

jnyi commented Mar 8, 2023

+1 on this issue, we encountered this today, the TSDB head is March 12, 2023 as I comment today, any ideas @fpetkovski ?

thanos-future-head

thanos-future-time

@jnyi
Copy link
Contributor

jnyi commented Mar 8, 2023

So we enabled this experimental flag "--tsdb.out-of-order.time-window=15m" which works fine for too old timestamp (within 2hr), however from prometheus upstream, it doesn't have a way to protect itself to append a future timestamp: https://github.com/prometheus/prometheus/blob/main/tsdb/head_append.go#L401

@fpetkovski
Copy link
Contributor

@defreng What you suggested makes sense to me. Having some sort of an upper bound on the timestamp should at least prevent big losses of data.

@jnyi
Copy link
Contributor

jnyi commented Mar 8, 2023

Hi Guys, put up a PR to address this issue, since it is my first time, welcome some early feedback (will add change log if it is good to go).

@GiedriusS
Copy link
Member

There is now a flag to avoid ingesting samples that are too far into the future hence closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants