Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

receive: Stops storing data #3765

Closed
jfg1701a opened this issue Feb 2, 2021 · 26 comments
Closed

receive: Stops storing data #3765

jfg1701a opened this issue Feb 2, 2021 · 26 comments

Comments

@jfg1701a
Copy link

jfg1701a commented Feb 2, 2021

Thanos, Prometheus and Golang version used:

thanos, version 0.14.0 (branch: master, revision: 70f89d837eebd672926663dd8876035860511f06)
  build user:       circleci@a770acd66205
  build date:       20200812-10:39:09
  go version:       go1.14.2

Object Storage Provider: MinIO

What happened: On one of our internal test systems, thanos receive will stop processing incoming data on regular intervals. The following thanos ui query shows metric node_cpu_seconds_total for the last week. This metric comes from a prometheus instance that is monitoring the OCP cluster:

image

As shown in the image, regular outages of 8 hours or more are occurring. The latest outage occurred on Feb 1, and lasted for 18 hours.

What you expected to happen: Thanos receive processes incoming metric data without error.

How to reproduce it (as minimally and precisely as possible): We are not sure what is causing it. Seems to occur periodically without any user intervention.

Full logs to relevant components:

Here's what the thanos receive log showed around the time that it resumed accepting metrics:

Logs

level=warn ts=2021-02-01T23:28:47.88926126Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=3384
level=warn ts=2021-02-01T23:28:47.895368012Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=3313
level=warn ts=2021-02-01T23:29:46.425728194Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=2
level=warn ts=2021-02-01T23:29:46.427602034Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=2
level=warn ts=2021-02-01T23:29:46.428611741Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=3
level=warn ts=2021-02-01T23:29:48.662364042Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=2
level=warn ts=2021-02-01T23:29:48.668642017Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=3
level=warn ts=2021-02-01T23:29:48.669927944Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=2
level=warn ts=2021-02-01T23:29:51.384631506Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=2
level=warn ts=2021-02-01T23:29:51.38742069Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=3
level=warn ts=2021-02-01T23:29:51.397757579Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=2
level=warn ts=2021-02-01T23:29:56.5780011Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=2

As shown in the log, thanos receive was basically throwing away everything, then suddenly started accepting metrics again. The following messages were observed two hours later:

Logs

level=info ts=2021-02-02T02:29:39.78375504Z caller=compact.go:494 component=receive component=multi-tsdb tenant=a02d6835-208b-446e-86b3-dfdbc5ca849a msg="write block" mint=1612222126000 maxt=1612224000000 ulid=01EXG95J4TWA4G17SYES7GNJKG duration=11.437024317s
level=info ts=2021-02-02T02:29:40.104937574Z caller=head.go:807 component=receive component=multi-tsdb tenant=a02d6835-208b-446e-86b3-dfdbc5ca849a msg="Head GC completed" duration=223.866704ms
level=info ts=2021-02-02T02:29:40.753078185Z caller=checkpoint.go:96 component=receive component=multi-tsdb tenant=a02d6835-208b-446e-86b3-dfdbc5ca849a msg="Creating checkpoint" from_segment=1118 to_segment=1120 mint=1612224000000
level=info ts=2021-02-02T02:29:41.063104019Z caller=head.go:887 component=receive component=multi-tsdb tenant=a02d6835-208b-446e-86b3-dfdbc5ca849a msg="WAL checkpoint complete" first=1118 last=1120 duration=311.134619ms
level=info ts=2021-02-02T02:30:02.423330728Z caller=shipper.go:333 component=receive component=multi-tsdb tenant=a02d6835-208b-446e-86b3-dfdbc5ca849a msg="upload new block" id=01EXG95J4TWA4G17SYES7GNJKG
level=info ts=2021-02-02T03:01:10.543246542Z caller=compact.go:494 component=receive component=multi-tsdb tenant=a02d6835-208b-446e-86b3-dfdbc5ca849a msg="write block" mint=1612224000000 maxt=1612231200000 ulid=01EXGAZ83MJGJCYRET8PF2E3E8 duration=11.930235283s

Anything else we need to know:

We tried to resolve the problem initially by restarting the thanos receive and thanos receive controller pods, but it didn't help. We also tried restarting the memcached and store pods, but it had no affect. We then decided to leave the system as-is overnight, and found this morning that it had started to work again.

@jfg1701a
Copy link
Author

jfg1701a commented Feb 3, 2021

We are starting to see this issue on several systems now. Here is a partial receiver log from the latest incident:

receiver-log.txt

It looks like after the second WAL checkpoint that all metric data is being thrown away.

@bwplotka
Copy link
Member

bwplotka commented Feb 4, 2021

Thanks for the report!

First of all, you are running very old receive version. There were tons of new things, improvements and bugfixes added without changing API, so before we do anything else, just upgrade to v0.18.0 🤗 We use TSDB (Prometheus) code underneath so we had many upgrades since then. There is little point into diving too much into this problem before upgrade.

@bwplotka
Copy link
Member

bwplotka commented Feb 4, 2021

Actually looks like the problem is that the incident took more than 2h that's why when restarted you got

level=warn ts=2021-02-01T23:29:56.5780011Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=2

@bwplotka
Copy link
Member

bwplotka commented Feb 4, 2021

Question is: Why you get into this place first (:

@bwplotka
Copy link
Member

bwplotka commented Feb 4, 2021

All the logs you provided are for the situation after incident. It's expected that some of the older data will be skipped becuase it touched the timestamps which were already assumed as immutable (produced a block).

Let's focus on what we want to fix in this issue 🤗

Do we want to be able to consumpe older data? Let's create issue for that then (I think we had some).
Do we want to avoid 8-18h of receive being down? Definitely, but this is something you need to ensure with e.g k8s automation and replication.

@jfg1701a
Copy link
Author

jfg1701a commented Feb 4, 2021

We used the observability configuration where there are 3 receiver instances configured with the hash ring. All of the receivers suffered this issue.

@bwplotka
Copy link
Member

bwplotka commented Feb 4, 2021

Which issue? Starting up after 18h being down and ignoring incoming write requests until write requests have newer timestamps? That's expected. (:

@jfg1701a
Copy link
Author

jfg1701a commented Feb 4, 2021

Sorry if I was not clear; the receivers stayed up the whole time, they just refused to store any data. This caused an 18 hour gap in the metric graphs. After 18 hours, they just suddenly started working again.

The most recent log shows a receiver that was deployed on a newly installed system. The pod stayed up the whole time, but after a while it stopped accepting metrics and said that everything is either too old or too far in the future.

@jfg1701a
Copy link
Author

jfg1701a commented Feb 4, 2021

We narrowed the problem down to the metrics from a single agent, and we are in the process of investigating why the metrics from this agent are causing all metric storage to stop in thanos receive.

@jfg1701a
Copy link
Author

jfg1701a commented Feb 4, 2021

We've determined that the problem is most likely being caused by an incorrect (future) UTC date being specified in the agent metrics.

Date specified: "2021-02-05T04:55:19.000Z"
Server date: Thu Feb 4 18:34:17 UTC 2021

The agent needs to fix this, but this should not be causing a complete metric storage outage in thanos receive.

@jfg1701a
Copy link
Author

jfg1701a commented Feb 5, 2021

We confirmed that the the appearance of a future timestamp in the data was causing thanos receive to stop processing all other incoming metric data. The length of the outage seemed to be related to how far in the future the timestamp was.

More information from Zach Shearin (a developer on my squad):

Okay so the root of this problem was that an agent connected was sending data from the future (looks to be about 10ish hours from the future - guessing they are making a mistake with timestamping local vs UTC time).

The weird thing is - when this single agent is connected, it brings down all data collection. The only thing that I was thinking that could potentially cause that is this:

  • Is the thanos receive component's notion of time relative to the latest timestamp that it has received?
  • If so, then the "future" data could have been received as too far in the future, but then "current data" with accurate timestamps coming in would be rendered as "too old" (even though its current data - its old relative to 10 hours from now)
    Is this theory accurate or off base?

@jfg1701a
Copy link
Author

jfg1701a commented Feb 5, 2021

We've added a filter to our own metric collection logic to throw away metrics that specify a future time, so our immediate problem has been solved. But this seems like logic that should be in thanos receive.

@kakkoyun kakkoyun changed the title Thanos receive stops storing data receive: Stops storing data Feb 12, 2021
@clyang82
Copy link
Contributor

We met the same problem I think. We lost 2 hours data periodically. from thanos tools bucket web we can see the line 4 6 9 are for the current tenant id. we can see that the data is discontinuous.
image
the thanos version we are using is 0.15. we are using AWS S3 as object storage.

there is no data from 2021-02-21T22:00 to 2021-02-22T00:00. Here are logs of the receive in that period.

level=info ts=2021-02-21T23:00:15.466857082Z caller=compact.go:494 component=receive component=multi-tsdb tenant=3724ea92-4184-4e9f-8fe1-db5ebe7270d0 msg="write block" mint=1613937600000 maxt=1613944800000 ulid=01EZ3D4JCCFHAGK2GGKB78789X duration=9.31019266s
level=info ts=2021-02-21T23:00:16.164084347Z caller=head.go:807 component=receive component=multi-tsdb tenant=3724ea92-4184-4e9f-8fe1-db5ebe7270d0 msg="Head GC completed" duration=309.722829ms
level=info ts=2021-02-21T23:00:16.193272494Z caller=checkpoint.go:96 component=receive component=multi-tsdb tenant=3724ea92-4184-4e9f-8fe1-db5ebe7270d0 msg="Creating checkpoint" from_segment=200 to_segment=329 mint=1613944800000
level=info ts=2021-02-21T23:00:17.739929183Z caller=shipper.go:333 component=receive component=multi-tsdb tenant=3724ea92-4184-4e9f-8fe1-db5ebe7270d0 msg="upload new block" id=01EZ3D4JCCFHAGK2GGKB78789X
level=error ts=2021-02-21T23:00:21.400361965Z caller=db.go:732 component=receive component=multi-tsdb tenant=3724ea92-4184-4e9f-8fe1-db5ebe7270d0 msg="compaction failed" err="reload blocks: head truncate failed: create checkpoint: read segments: corruption in segment /var/thanos/receive/3724ea92-4184-4e9f-8fe1-db5ebe7270d0/wal/00000200 at 100892672: unexpected checksum 933ca57c, expected b1c18308"

FYI @bwplotka

@stale
Copy link

stale bot commented Jun 3, 2021

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Jun 3, 2021
@jfg1701a
Copy link
Author

jfg1701a commented Jun 4, 2021

Any plan to prevent this from happening in thanos receive?

@stale stale bot removed the stale label Jun 4, 2021
@stale
Copy link

stale bot commented Aug 3, 2021

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Aug 3, 2021
@jfg1701a
Copy link
Author

jfg1701a commented Aug 4, 2021

As far as I know, this is still a problem in thanos receive. We worked around it by proactively throwing away metric data if the timestamp is a future date. We recently ran into a case where the timestamp was only a few seconds ahead of the server, so we may revisit this in our product (i.e automatically reset the timestamp to the current time if it is only a few seconds ahead).

A permanent solution really should be implemented in thanos receive.

@stale stale bot removed the stale label Aug 4, 2021
@jzangari
Copy link

We are confirming this, but it appears we are running into the same issue as you are @jfg1701a. I'll post back after we have definite confirmation.

@jzangari
Copy link

We were able to confirm this. We submitted a datapoint in the future for one series, and from that point on we received 409 Conflict for any submission to any series prior to that point in the future.

@mtlang
Copy link
Contributor

mtlang commented Aug 26, 2021

Member of Josh's team here. After some testing, I believe this is tied to when the reciever's current block ends. I tested on fresh recievers with 1 hour old blocks. Sending data 50 minutes in the future (still in the current block) made Thanos return 409 errors when I sent that same metric with a current timestamp. I could still send data for other metrics, and after waiting 50 minutes, the original metric would work again. So far, this is expected behavior. However, when I sent data 70 minutes into the future (after the current block), I would get a 409 error back if I tried to send current data for any metric.

@stale
Copy link

stale bot commented Oct 30, 2021

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Oct 30, 2021
@stale
Copy link

stale bot commented Jan 9, 2022

Closing for now as promised, let us know if you need this to be reopened! 🤗

@stale stale bot closed this as completed Jan 9, 2022
@edgrz
Copy link

edgrz commented Feb 27, 2023

I'm running into the same issue as @jfg1701a and @jzangari were mentioning.

I have multiple prometheus instances (different clusters) using remote_write mechanism to forward their metrics to a shared receiver. If by any chance one of the cluster has an issue with time and is sending metrics in the future, receivers starts to complain and drop all incoming metrics, regardless of source cluster (each of them with different externalLabels).

Receiver logs:

level=warn ts=2023-02-27T12:54:57.827885697Z caller=writer.go:188 component=receive component=receive-writer tenant=default-tenant msg="Error on ingesting samples that are too old or are too far into the future" numDropped=1663
level=warn ts=2023-02-27T12:54:57.937124259Z caller=writer.go:188 component=receive component=receive-writer tenant=default-tenant msg="Error on ingesting samples that are too old or are too far into the future" numDropped=1648

Prometheus are receiveing 409 as reported:

ts=2023-02-27T13:26:39.223Z caller=dedupe.go:112 component=remote level=error remote_name=4ddd6d url=https://thanos-receiver.edgar-270222.staging.anywhere.navify.com/api/v1/receive msg="non-recoverable error" count=1451 exemplarCount=0 err="server returned HTTP status 409 Conflict: 3 errors: forwarding request to endpoint thanos-receive-ingestor-default-0.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-ingestor-default-0.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: add 524 samples: out of bounds; forwarding request to endpoint thanos-receive-ingestor-default-1.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-ingestor-default-1.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: add 469 samples: out of bounds; forwarding request to endpoint thanos-receive-ingestor-default-2.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-ingestor-default-2.thanos-receive-ingestor-default.thanos.svc.cluster.local:10901: add 458 samples: out of bounds"

Is there any plan for thanos receiver to implement a logic to prevent such issues from happening?

Currently if one of the sources has an issue, the rest of sources are also affected making Thanos not resilient to clock issues.

Thanos version: v0.30.2

@jnyi
Copy link
Contributor

jnyi commented Mar 8, 2023

Ran into this issue too, the avg(prometheus_tsdb_head_max_time > time() * 1000) will be a future time which basically blocks all current traffic, we need a way to avoid clock skrew samples to become the head.

@Prakashckesavan
Copy link

Hi , I am also facing this same issue , do we have any solution on this, Thanks in advance.

@jnyi
Copy link
Contributor

jnyi commented May 21, 2024

hi @Prakashckesavan, I had a fix and it was linked to another issue: #6195

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants