-
Notifications
You must be signed in to change notification settings - Fork 809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributor Warning: removing ingester failing healthcheck #3028
Comments
When an ingester gracefully shutdowns, it removes itself from the ring and the issue you describe shouldn't happen. However, if the ingester pod is not cleanly shutdown (eg. process crash, node failure, ...), the ingester will not be removed from the ring and you're expected to manually address it (if it happens on more then 1 ingester, then you may have data loss). By "manually address it" I mean opening the I understand this is not ideal from the operability point of view, and we may reconsider / rediscuss it. Getting back to your issue, I've the feeling that when the pod is evicted the SIGTERM is not sent to the Cortex process and no clean shutdown happens. May you double check it, please? |
@pracucci Thank you very much. So, I decide to create a tiny service to monitor the And I think whether it is possible to add this little feature to Cortex, for example, Cortex may provide an API(eg. |
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions. |
Hi @pracucci what about if I click "forget" and 1-2 seconds later the ingesters are there again? |
If an ingester is running (healthy), it will keep adding itself to the ring if it can't find an entry for itself. This means that, if an ingester is running, and you "forget it", it will be automatically readded back few seconds later. If the ingester is "unhealthy", it's expected to not run (eg. process crashed, node unresponsive, ...) and a manual forget is required to remove that ingester from the ring. |
Thanks @pracucci for the update. As far as I can tell I had only that three |
Where do you store the ring? Consul? Etcd? Memberlist? |
Sorry, didn't mention that. Memberlist. |
Thanks. I suspect there is some bug related to forgetting when using memberlist, but we haven't yet been able to poinpoint it down :( |
I think #3603 will fix this issue. |
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions. |
If you still see this, another setting that may help is to change |
Here the situation:
When a ingester pod is evicted, there will be a warning in distributor:
I think it is because of the ingester pod is already evicted, so the IP addr is not exist. And if the number of evicted ingester is bigger than the half of all ingesters, there will be an error in distributor:
So what can I do to deal with this problem?
Thanks.
The text was updated successfully, but these errors were encountered: