-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
receive: Stops storing data #3765
Comments
We are starting to see this issue on several systems now. Here is a partial receiver log from the latest incident: It looks like after the second WAL checkpoint that all metric data is being thrown away. |
Thanks for the report! First of all, you are running very old receive version. There were tons of new things, improvements and bugfixes added without changing API, so before we do anything else, just upgrade to |
Actually looks like the problem is that the incident took more than 2h that's why when restarted you got
|
Question is: Why you get into this place first (: |
All the logs you provided are for the situation after incident. It's expected that some of the older data will be skipped becuase it touched the timestamps which were already assumed as immutable (produced a block). Let's focus on what we want to fix in this issue 🤗 Do we want to be able to consumpe older data? Let's create issue for that then (I think we had some). |
We used the observability configuration where there are 3 receiver instances configured with the hash ring. All of the receivers suffered this issue. |
Which issue? Starting up after 18h being down and ignoring incoming write requests until write requests have newer timestamps? That's expected. (: |
Sorry if I was not clear; the receivers stayed up the whole time, they just refused to store any data. This caused an 18 hour gap in the metric graphs. After 18 hours, they just suddenly started working again. The most recent log shows a receiver that was deployed on a newly installed system. The pod stayed up the whole time, but after a while it stopped accepting metrics and said that everything is either too old or too far in the future. |
We narrowed the problem down to the metrics from a single agent, and we are in the process of investigating why the metrics from this agent are causing all metric storage to stop in thanos receive. |
We've determined that the problem is most likely being caused by an incorrect (future) UTC date being specified in the agent metrics. Date specified: "2021-02-05T04:55:19.000Z" The agent needs to fix this, but this should not be causing a complete metric storage outage in thanos receive. |
We confirmed that the the appearance of a future timestamp in the data was causing thanos receive to stop processing all other incoming metric data. The length of the outage seemed to be related to how far in the future the timestamp was. More information from Zach Shearin (a developer on my squad):
|
We've added a filter to our own metric collection logic to throw away metrics that specify a future time, so our immediate problem has been solved. But this seems like logic that should be in thanos receive. |
We met the same problem I think. We lost 2 hours data periodically. from there is no data from 2021-02-21T22:00 to 2021-02-22T00:00. Here are logs of the receive in that period.
FYI @bwplotka |
Hello 👋 Looks like there was no activity on this issue for the last two months. |
Any plan to prevent this from happening in thanos receive? |
Hello 👋 Looks like there was no activity on this issue for the last two months. |
As far as I know, this is still a problem in thanos receive. We worked around it by proactively throwing away metric data if the timestamp is a future date. We recently ran into a case where the timestamp was only a few seconds ahead of the server, so we may revisit this in our product (i.e automatically reset the timestamp to the current time if it is only a few seconds ahead). A permanent solution really should be implemented in thanos receive. |
We are confirming this, but it appears we are running into the same issue as you are @jfg1701a. I'll post back after we have definite confirmation. |
We were able to confirm this. We submitted a datapoint in the future for one series, and from that point on we received 409 Conflict for any submission to any series prior to that point in the future. |
Member of Josh's team here. After some testing, I believe this is tied to when the reciever's current block ends. I tested on fresh recievers with 1 hour old blocks. Sending data 50 minutes in the future (still in the current block) made Thanos return 409 errors when I sent that same metric with a current timestamp. I could still send data for other metrics, and after waiting 50 minutes, the original metric would work again. So far, this is expected behavior. However, when I sent data 70 minutes into the future (after the current block), I would get a 409 error back if I tried to send current data for any metric. |
Hello 👋 Looks like there was no activity on this issue for the last two months. |
Closing for now as promised, let us know if you need this to be reopened! 🤗 |
I'm running into the same issue as @jfg1701a and @jzangari were mentioning. I have multiple prometheus instances (different clusters) using Receiver logs:
Prometheus are receiveing 409 as reported:
Is there any plan for Currently if one of the sources has an issue, the rest of sources are also affected making Thanos not resilient to clock issues. Thanos version: v0.30.2 |
Ran into this issue too, the |
Hi , I am also facing this same issue , do we have any solution on this, Thanks in advance. |
hi @Prakashckesavan, I had a fix and it was linked to another issue: #6195 |
Thanos, Prometheus and Golang version used:
Object Storage Provider: MinIO
What happened: On one of our internal test systems, thanos receive will stop processing incoming data on regular intervals. The following thanos ui query shows metric node_cpu_seconds_total for the last week. This metric comes from a prometheus instance that is monitoring the OCP cluster:
As shown in the image, regular outages of 8 hours or more are occurring. The latest outage occurred on Feb 1, and lasted for 18 hours.
What you expected to happen: Thanos receive processes incoming metric data without error.
How to reproduce it (as minimally and precisely as possible): We are not sure what is causing it. Seems to occur periodically without any user intervention.
Full logs to relevant components:
Here's what the thanos receive log showed around the time that it resumed accepting metrics:
As shown in the log, thanos receive was basically throwing away everything, then suddenly started accepting metrics again. The following messages were observed two hours later:
Anything else we need to know:
We tried to resolve the problem initially by restarting the thanos receive and thanos receive controller pods, but it didn't help. We also tried restarting the memcached and store pods, but it had no affect. We then decided to leave the system as-is overnight, and found this morning that it had started to work again.
The text was updated successfully, but these errors were encountered: