-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thanos receive store locally for endpoint conflict #3913
Comments
Thanks @AsherBoone for raising this! Seeing probably the same issue on my clusters. It happens always when I roll the receiver statefulset. Error on Prometheus side: Seeing this and several other errors on Thanos-receive pod: Any help/hint is appreciated! |
We see the same issues with our receiver and prometheus setup with prometheus:v2.26.0 and thanos:v0.20.1 respectively |
same issue, thanos, version 0.21.1 (branch: HEAD, revision: 3558f4a) |
We see similar issue, which leaves a hole of hours of missing metrics. thanos v0.19.0, prometheus v2.27.1. The issue happens occasionally when we roll thanos-receive (replication factor 2, replicas 3). During the period of missing metrics, I see streams of errors with "conflict" and "HTTP status 500", which is interesting. Here is one example (with new lines inserted and endpoints shortened:
And similar in thanos log:
If I understand correctly, this means that both thanos-receive-1 and thanos-receive-0 already has the metric sent by Prometheus. Why would Thanos respond with status 500 causing Prometheus to retry? |
Hello 👋 Looks like there was no activity on this issue for the last two months. |
Closing for now as promised, let us know if you need this to be reopened! 🤗 |
same issue in thanos v0.23. |
I think the only valid error here is what @starleaffff said (if you still have that). Actually the message exactly says what happened: (...) Conflict: store locally (...) Meaning: I already have the data you've sent me, please store it locally and don't bother me with it. And the receiver does just that unlike the sidecar which will provide the same data to the queriers and the queriers deduplicate. So, yeah. 409 is okay, though very disturbing to see in the prometheus log. Maybe this should be documented somewhere? (Or it is and I haven't come across) |
Thank you for your reply. Maybe this error actually makes no sense. |
|
I too clench up with these errors. |
The same in v0.25, any one can help? |
Seeing same issue on v0.25.2 as well. |
Seeing same issue on v0.22.0 as well |
Seeing the same issue on v0.26.0 as well. |
@sharathfeb12 what version of Prometheus are you running, and are you running it in agent mode? |
I am running v2.30.1. seeing the same issue on v2.36.1 as well. |
@sharathfeb12 I am guessing you are running with a replication factor greater than 1? Out of interest are you running Thanos Router-Ingestor or just a single stage Recevier? Temporarily setting the replication factor to 1 seemed to solve the issues. I have created #5407 to track some of the debugging that I have been doing for this issue. My guess is that the incorrect status code is returned by Thanos which causes Prometheus to keep retrying sending the same time series which Thanos already has. The reason setting the replication factor to 1 seems to help is because there is different error handling logic for it. |
I am running with a replication factor of 2. This is because our GKE clusters go through a node pool upgrade very often. We do not want an outage when this happens. So, we are using a replication factor of 2. Due to the errors, the service teams think there is an issue on the server-side; and are not confident using the Thanos solution. I have also seen that the count goes down when we run with 1 prometheus replica instead of running in HA. |
I see this issue in thanos 0.28.0 as well. I am running with replication factor = 1. But still see this occassionally. I have to turn off remote write on all prometheus instances, then re-enable it. Would be great if someone can work on a fix for this issue. |
@cybervedaa there are couple of related issues see namely #5407, we're looking at these actively 👍 |
Thank you for the update, Matej
…On Wed, Oct 19, 2022 at 12:25 AM Matej Gera ***@***.***> wrote:
@cybervedaa <https://github.com/cybervedaa> there are couple of related
issues see namely #5407 <#5407>,
we're looking at these actively 👍
—
Reply to this email directly, view it on GitHub
<#3913 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFVZHOFLUDXMXGIGPB5H4DWD6O7ZANCNFSM4ZAB2PIQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thanos and Prometheus version used:
Thanos: v0.18.0
Prometheus: v2.11.1
Object Storage Provider:
S3
What happened:
Prometheus Error in the logs:
hashring.json (configMap):
thanos receive:
receive prometheus remote_write: http://10.53.26.191:30021/api/v1/receive (nodeport mode)
Prometheus log has a lot of errors,I tried to modify the thanos receive many times, but the conflict still appeared. Can anyone help me?
The text was updated successfully, but these errors were encountered: