-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Receiver: Metrics with out-of-order labels get accepted #5499
Comments
Can Receive know if the labels are out of order without having to effectively copy, sort the copy, and then and compare to the original? 🤔 |
After some more thought and taking a look at the remote write spec, I feel we should not be condoning client's bad behavior, so instead of trying to fix wrong labels I'd suggest to drop series with invalid labels from the request and handle this as a bad request error. I gave it a stab in #5508. |
@matej-g so we drop the series with bad labels, accept the others with good labels and returning a bad request? How does Prometheus client react to a bad request? I am afraid it could send again the same request, duplicating the series with good labels. |
Prometheus considers bad request to be unrecoverable errors (unlike e.g. server errors), so it should not retry on such response. This is equivalent to how we handle requests with invalid samples or exemplars. I tried to describe it more in depth in the PR description. |
Awesome. It sounds like a good flow. Thanks for the explanation. 🙇 |
uuh, google does not like the suffix of this, found a lot of issues with out-of-order samples, but this with labels was buried:) Running 0.34 thanos-receivers and It does work accordingly to the above, drops the requests:) I'm sending writes from Prometheus instances, however, remote_write config doesn't seem to have an option to send sorted, which is weird because the spec explicitly states that stuff should be sorted, it might make sense to provide the possibility to sort before egressing. Or inside the receiver to sort before ingesting? Either way, is there a way I can explicitly sort? or just labelkeep blocks in order? @matej-g @douglascamata Thanks! |
Oh I don't think Prometheus should write ooo labels in remote write, is there something else writing? |
It's only promethesis, hundreds of them. Most of them are on version 2.50, with a couple still on 2.26, let me see if the old ones are causing the issue |
Can you run receivers with debug logging? I think they should log the offending labels then iirc. |
Our of curiosity how do the rejected labelsets look? |
That lset also has a couple duplicate labels after rule_group! That's really weird |
it is really weird, because checking the values, those are different its even a different pod, from a different node. one is RISC the other is CISC It's like two long label strings were randomly concatenated. |
This is not sorted to being with, tho it duplicates again, even the node. Will debug tomorrow, but I think the logging might be wrong, (we are consuming these currently with another layer of big prom instances and did not notice any label duplication or weirdness like the above. Ofc those are scraping) |
Thanos, Prometheus and Golang version used:
Latest Thanos (
0.27.0
)What happened:
Receiver happily accepts metrics with out-of-order labels. By itself, this seemed to be acceptable (i.e. TSDB will append them without complaining). But this is causing issues in compactor, which expects labels to be in order and this causes it to get into a crash loop.
This is similar to situation that happened previously with prometheus/prometheus#5372. Though we're in remote write land and we may assume in most cases we are getting metrics from 'official' clients that should adhere to the labels ordering, this cannot be guaranteed with all clients (as is evidenced by our related issue #5497, which we caused by running some custom tests with an 'unofficial' client from https://github.com/observatorium/up).
What you expected to happen:
I would expect the receiver to handle this, at least in a minimal fashion, to prevent potential issues with compactor down the line. Possible options:
The text was updated successfully, but these errors were encountered: