Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul left with many failed nodes (former agents) over time in cloud environment #544

Closed
owaaa opened this issue Dec 16, 2014 · 30 comments
Closed
Labels
type/bug Feature does not function as expected

Comments

@owaaa
Copy link

owaaa commented Dec 16, 2014

I'm experiencing a situation where over time, as boxes with agents are re-built and not always gracefully deregistered, I'm left in a situation where I have dozens of failed nodes with 0 services. If you issue deregister via the UI or the API the nodes still eventually come back. I have found that force-leave works but I have to issue these manually per failed node. Having many failed nodes makes things dirty and I'd like to figure out how to keep this clean as the failed state is not offering any value.

@armon
Copy link
Member

armon commented Dec 17, 2014

@owaaa The nodes should automatically reap out after 72hours (not yet configurable, but soon). Otherwise, the best route is to issue a graceful leave before destroying the nodes (consul leave), so that they can be reaped immediately. They are kept around for that long since without a graceful leave, Consul cannot distinguish between a temporary failure, agent crash, network partition, etc.

@c4milo
Copy link
Contributor

c4milo commented May 5, 2015

I had this same issue, I also tried issuing a consul force-leave but the nodes were still lingering.

@armon
Copy link
Member

armon commented May 7, 2015

@c4milo Were the nodes dead when you did a "force-leave"? Do they show up in "consul members" at all?

@armon armon added the type/bug Feature does not function as expected label May 7, 2015
@c4milo
Copy link
Contributor

c4milo commented May 7, 2015

They were dead yes, they showed up as failed members.

@armon
Copy link
Member

armon commented May 7, 2015

@c4milo Hmm force-leave should push them into the "left" state. Nodes are not reaped until they are in failed for 72h or in the left state. Did "force-leave" not cause them to go to "left"?

@c4milo
Copy link
Contributor

c4milo commented May 7, 2015

Nop

@c4milo
Copy link
Contributor

c4milo commented May 7, 2015

Another interesting fact, the wan pool didn't have the failed nodes.

@c4milo
Copy link
Contributor

c4milo commented May 12, 2015

@armon I just ran into this again, here is a video: https://asciinema.org/a/bd1apr97vc45f4syahet3dxig. Should I open a separate issue for this?

@armon
Copy link
Member

armon commented May 12, 2015

@c4milo Everything looks fine in that video, the nodes are in the left state. "force-leave" just moves a node from the "failed" -> "left" state. They are not removed from the members list for 24 or 72h.

@c4milo
Copy link
Contributor

c4milo commented May 12, 2015

I see. I'm going to need more sleep. Thanks @armon.

@gtmtech
Copy link

gtmtech commented Jun 11, 2015

+1 for making the reap time configurable

@bryanwb
Copy link

bryanwb commented Nov 11, 2015

+1 for making reap time configurable

@atrbgithub
Copy link

Another +1 for making the reap time configurable

@c4milo
Copy link
Contributor

c4milo commented Nov 20, 2015

-1 if it makes the cluster unstable or it is prone to people having more issues than usual.

@goacid
Copy link

goacid commented Nov 23, 2015

+1 for making reap time configurable

1 similar comment
@s1l0uk
Copy link

s1l0uk commented Jan 19, 2016

+1 for making reap time configurable

@developerinlondon
Copy link

+1 for the reap time configurable pls!

@erkules
Copy link

erkules commented Feb 18, 2016

+1 for the reap time configurable

@naydencho
Copy link

+1 for configurable reap time

@dim
Copy link

dim commented Mar 21, 2016

👍

2 similar comments
@Poogles
Copy link

Poogles commented Mar 21, 2016

👍

@nadirollo
Copy link

👍

@ThomasGilbert
Copy link

+1 for the reap time configurable

@slackpad
Copy link
Contributor

While implementing #1935 which makes this configurable, and reviewing it with @sean- we realized that lowering this too much can be fairly dangerous for the case of Consul servers. If there's a partition that isolates a server and this is set low, the server could get kicked prematurely and would need to be re-joined or restarted in order to work again, whereas with the current default setting you'd have to have 72 hours elapse before that's a problem.

Are people mostly worried about clutter in consul members, or is there some other problem people are hoping will be fixed by making this configurable?

@slackpad
Copy link
Contributor

I think at the very least if we make it configurable we should set the minimum relatively high, ~8 hours or more.

@lucaswxp
Copy link

@slackpad For me, the problem that this causes is that /v1/catalog/service/service endpoint, still shows services from failed nodes, for me it makes more sense, once it's failed, the services get removed.

@slackpad
Copy link
Contributor

@lucaswxp usually clients use the https://www.consul.io/docs/agent/http/health.html#health_service endpoint to find healthy instances (there's a ?passing parameter that will filter to only healthy ones).

@MitchFierro
Copy link

MitchFierro commented May 1, 2018

@slackpad using the health API is not always convenient (the payloads are larger, we are not interested in node/check data, only the service entries), we prefer the /v1/catalog/service/:service API. The problem is that this returns service entries on failed nodes. Is there a way to filter this response to only return services running on known active (not failed) nodes? What is the rationale behind returning service entries for nodes known to be in a failed state?

@nikgibbens
Copy link

Agreed with @MitchFierro If the /v1/catalog/service/service endpoint could also take a passing=true parameter, this would be preferable in the interest of a smaller response payload.

@Dmitry1987
Copy link

While implementing #1935 which makes this configurable, and reviewing it with @sean- we realized that lowering this too much can be fairly dangerous for the case of Consul servers. If there's a partition that isolates a server and this is set low, the server could get kicked prematurely and would need to be re-joined or restarted in order to work again, whereas with the current default setting you'd have to have 72 hours elapse before that's a problem.

Are people mostly worried about clutter in consul members, or is there some other problem people are hoping will be fixed by making this configurable?

I have alerts firing for nodes that scaled-in by autoscaling policy... what's the official way to avoid having scaled-in services hanging and alerting in consul? Does force-leave fix this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Feature does not function as expected
Projects
None yet
Development

No branches or pull requests