-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compactor: HTTP socket is only opened after initial cleanup of compacted blocks is complete #3395
Comments
Yes, this is expected because we want to reduce the count of objects if possible. Imagine a scenario where Thanos Compactor goes into a crash loop and not even the old files get deleted. What do you propose? Why doesn't increasing |
Thanks for triaging. Sure, we could increase The problem is that effectively we have compactor perform some work, run the initial cleanup (or so we hope) and once that's done, it goes to serve metrics and healthchecks. This seems to be the wrong way around. Granted, our particular setup makes this issue more visible than it would otherwise be (we have several huge buckets that receive > 500GB of metrics per day, so there's a lot of compaction and cleanup to do), but it's the behaviour itself that I think we should review. What do you think? |
What we did indeed is that we added cleanup job before compactor starts. I think the bug here is that probe should say 200 OK sooner (: cc @GiedriusS WDYT? |
Help wanted, to me, it's just enabling probe early on Compactor. To me it sounds valid. |
Fixed in v0.17.2 (#3532). |
Thanos, Prometheus and Golang version used:
Thanos Docker image: thanosio/thanos:master-2020-10-22-57076a58
Object Storage Provider:
GCP
What happened:
We wanted to make use of the new compactor functionality introduced in #3115, so upgraded compactor pods to version "master-2020-10-22-57076a58".
Thanos compactors run in K8s with liveness and readiness probes configured to probe
/-/healthy
and/-/ready
respectively.After upgrade we noticed pods getting killed regularly because they were failing liveness probes.
This also means that we aren't able to collect any metrics from the affected compactor pods until HTTP socket is open.
Compactor configuration:
What you expected to happen:
Thanos compactors operates w/o issues :)
How to reproduce it (as minimally and precisely as possible):
Deploy Thanos compactor version master-2020-10-22-57076a58 or later.
Observe that compactor doesn't open the HTTP socket right away by running
netstat
in the pod:Full logs to relevant components:
This is especially notable with buckets that have a large number of incoming blocks - initial bucket scan & cleanup can take over a minute.
The text was updated successfully, but these errors were encountered: