-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thanos Receive Ingester: In memory representation of metrics distorted during instability. #6265
Comments
Quick Update here - this is likely related to a combination of thanos receive crashing due to runtime panics and memory snapshotting being enabled on the prometheus TSDB. A similar issue was raised here: Although the above issue was addressed in Prometheus version 2.32.0 which corresponds to Prometheus go module version 0.32.0 and we're running thanos version 0.30.2 which uses Prometheus version 0.40.7. So it could be that there are further bugs with memory snapshotting. From logs we can see there were run time panics that were recovered :
However looking at kubernetes pod state we see that the pod exited with panics as well:
So my suspicion is there are two distinct issues. One that causes an unrecoverable panic on thanos receive and secondary issue that causes corruption of the Prometheus index when memory snapshotting is enabled and the TSDB crashes. |
Thanks for coming back with more information. The known panics should already be fixed in #6203 and #6271. The first one is already in 0.31 so updating might help you avoid similar issues in the future. We also updated the Prometheus version in the latest release so the snapshotting issue might also be fixed. |
Hey Filip,
Thanks for your quick response! I’ll give upgrading a shot and let you know what we see.
Best, Anugrah
…On Apr 12, 2023 at 22:56 -0700, thanos-io/thanos ***@***.***>, wrote:
Thanks for coming back with more information. The known panics should already be fixed in #6203 and #6271. The first one is already in 0.31 so updating might help you avoid similar issues in the past. We also updated the Prometheus version in the latest release so the snapshotting issue might also be fixed.
|
Hey folks - After upgrading, the receive ingesters have been significantly more stable. Though there was still one panic that occurred which caused a crash:
|
Looks like #6271, but I don't think this is released in 0.31. |
I think you might be right |
Thanos, Prometheus and Golang version used:
Thanos: 0.30.2 (using go1.19.5)
Object Storage Provider:
Minio
What happened:
Last weekend we had an incident with our thanos receive cluster that led to incorrect query results during and after the incident. We deploy thanos receive in the dual router/ingester mode and experienced instability on the ingesters causing high error rates on remote write requests. We're still investigating what caused the instability ( the amount of scheduled/running goroutines went above 100k and sustained at that level) - but the effect we noticed is that query results for a particular metric included other metrics altogether. For example:
I ran the below query at different queriers across our topology and narrowed the issue down specifically to the ingesters:
Once the ingester stabilized - this behavior persisted i.e querying for a particular metric would return other metrics in the result. This behavior was only observed on one tennant and only occurred for the duration of metrics served by the ingesters (4h) and was not present in the actual TSDB blocks generated and upload to S3. Ultimately we had to prune the affected tennant ( so a fresh TSDB instance would be created) in order to mitigate the issue.
What you expected to happen:
Metrics shouldn't have become intermingled since it's breaking PromQL guarantees. ( We were using the standard prometheus promQL engine).
How to reproduce it (as minimally and precisely as possible):
This is a tough one. The short answer is we don't know. We're still investigating the conditions that caused this behavior. I'll be sure to share more details as they arise. I still felt it was important to share this observation since in the six years I've been working in this field I have never seen something quite like this.
The text was updated successfully, but these errors were encountered: