-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consul left with many failed nodes (former agents) over time in cloud environment #544
Comments
@owaaa The nodes should automatically reap out after 72hours (not yet configurable, but soon). Otherwise, the best route is to issue a graceful leave before destroying the nodes (consul leave), so that they can be reaped immediately. They are kept around for that long since without a graceful leave, Consul cannot distinguish between a temporary failure, agent crash, network partition, etc. |
I had this same issue, I also tried issuing a |
@c4milo Were the nodes dead when you did a "force-leave"? Do they show up in "consul members" at all? |
They were dead yes, they showed up as failed members. |
@c4milo Hmm force-leave should push them into the "left" state. Nodes are not reaped until they are in failed for 72h or in the left state. Did "force-leave" not cause them to go to "left"? |
Nop |
Another interesting fact, the wan pool didn't have the failed nodes. |
@armon I just ran into this again, here is a video: https://asciinema.org/a/bd1apr97vc45f4syahet3dxig. Should I open a separate issue for this? |
@c4milo Everything looks fine in that video, the nodes are in the left state. "force-leave" just moves a node from the "failed" -> "left" state. They are not removed from the members list for 24 or 72h. |
I see. I'm going to need more sleep. Thanks @armon. |
+1 for making the reap time configurable |
+1 for making reap time configurable |
Another +1 for making the reap time configurable |
-1 if it makes the cluster unstable or it is prone to people having more issues than usual. |
+1 for making reap time configurable |
1 similar comment
+1 for making reap time configurable |
+1 for the reap time configurable pls! |
+1 for the reap time configurable |
+1 for configurable reap time |
👍 |
2 similar comments
👍 |
👍 |
+1 for the reap time configurable |
While implementing #1935 which makes this configurable, and reviewing it with @sean- we realized that lowering this too much can be fairly dangerous for the case of Consul servers. If there's a partition that isolates a server and this is set low, the server could get kicked prematurely and would need to be re-joined or restarted in order to work again, whereas with the current default setting you'd have to have 72 hours elapse before that's a problem. Are people mostly worried about clutter in |
I think at the very least if we make it configurable we should set the minimum relatively high, ~8 hours or more. |
@slackpad For me, the problem that this causes is that /v1/catalog/service/service endpoint, still shows services from failed nodes, for me it makes more sense, once it's failed, the services get removed. |
@lucaswxp usually clients use the https://www.consul.io/docs/agent/http/health.html#health_service endpoint to find healthy instances (there's a |
@slackpad using the health API is not always convenient (the payloads are larger, we are not interested in node/check data, only the service entries), we prefer the /v1/catalog/service/:service API. The problem is that this returns service entries on failed nodes. Is there a way to filter this response to only return services running on known active (not failed) nodes? What is the rationale behind returning service entries for nodes known to be in a failed state? |
Agreed with @MitchFierro If the /v1/catalog/service/service endpoint could also take a passing=true parameter, this would be preferable in the interest of a smaller response payload. |
I have alerts firing for nodes that scaled-in by autoscaling policy... what's the official way to avoid having scaled-in services hanging and alerting in consul? Does force-leave fix this? |
I'm experiencing a situation where over time, as boxes with agents are re-built and not always gracefully deregistered, I'm left in a situation where I have dozens of failed nodes with 0 services. If you issue deregister via the UI or the API the nodes still eventually come back. I have found that force-leave works but I have to issue these manually per failed node. Having many failed nodes makes things dirty and I'd like to figure out how to keep this clean as the failed state is not offering any value.
The text was updated successfully, but these errors were encountered: