Consul left with many failed nodes (former agents) over time in cloud environment #544

owaaa · 2014-12-16T16:27:14Z

I'm experiencing a situation where over time, as boxes with agents are re-built and not always gracefully deregistered, I'm left in a situation where I have dozens of failed nodes with 0 services. If you issue deregister via the UI or the API the nodes still eventually come back. I have found that force-leave works but I have to issue these manually per failed node. Having many failed nodes makes things dirty and I'd like to figure out how to keep this clean as the failed state is not offering any value.

armon · 2014-12-17T06:07:45Z

@owaaa The nodes should automatically reap out after 72hours (not yet configurable, but soon). Otherwise, the best route is to issue a graceful leave before destroying the nodes (consul leave), so that they can be reaped immediately. They are kept around for that long since without a graceful leave, Consul cannot distinguish between a temporary failure, agent crash, network partition, etc.

c4milo · 2015-05-05T23:02:05Z

I had this same issue, I also tried issuing a consul force-leave but the nodes were still lingering.

armon · 2015-05-07T00:13:40Z

@c4milo Were the nodes dead when you did a "force-leave"? Do they show up in "consul members" at all?

c4milo · 2015-05-07T01:03:18Z

They were dead yes, they showed up as failed members.

armon · 2015-05-07T01:14:35Z

@c4milo Hmm force-leave should push them into the "left" state. Nodes are not reaped until they are in failed for 72h or in the left state. Did "force-leave" not cause them to go to "left"?

c4milo · 2015-05-07T01:35:43Z

Nop

c4milo · 2015-05-07T01:37:38Z

Another interesting fact, the wan pool didn't have the failed nodes.

c4milo · 2015-05-12T17:40:00Z

@armon I just ran into this again, here is a video: https://asciinema.org/a/bd1apr97vc45f4syahet3dxig. Should I open a separate issue for this?

armon · 2015-05-12T18:14:35Z

@c4milo Everything looks fine in that video, the nodes are in the left state. "force-leave" just moves a node from the "failed" -> "left" state. They are not removed from the members list for 24 or 72h.

c4milo · 2015-05-12T18:49:08Z

I see. I'm going to need more sleep. Thanks @armon.

gtmtech · 2015-06-11T11:11:12Z

+1 for making the reap time configurable

bryanwb · 2015-11-11T14:05:36Z

+1 for making reap time configurable

atrbgithub · 2015-11-20T13:50:16Z

Another +1 for making the reap time configurable

c4milo · 2015-11-20T17:24:37Z

-1 if it makes the cluster unstable or it is prone to people having more issues than usual.

goacid · 2015-11-23T13:47:54Z

+1 for making reap time configurable

s1l0uk · 2016-01-19T12:46:55Z

+1 for making reap time configurable

developerinlondon · 2016-01-22T16:21:05Z

+1 for the reap time configurable pls!

erkules · 2016-02-18T20:29:54Z

+1 for the reap time configurable

naydencho · 2016-02-22T15:44:12Z

+1 for configurable reap time

dim · 2016-03-21T10:03:01Z

👍

Poogles · 2016-03-21T10:03:30Z

👍

nadirollo · 2016-04-07T10:47:01Z

👍

ThomasGilbert · 2016-04-08T13:32:31Z

+1 for the reap time configurable

slackpad · 2016-04-11T07:17:12Z

While implementing #1935 which makes this configurable, and reviewing it with @sean- we realized that lowering this too much can be fairly dangerous for the case of Consul servers. If there's a partition that isolates a server and this is set low, the server could get kicked prematurely and would need to be re-joined or restarted in order to work again, whereas with the current default setting you'd have to have 72 hours elapse before that's a problem.

Are people mostly worried about clutter in consul members, or is there some other problem people are hoping will be fixed by making this configurable?

slackpad · 2016-04-11T07:49:56Z

I think at the very least if we make it configurable we should set the minimum relatively high, ~8 hours or more.

lucaswxp · 2016-08-15T15:44:02Z

@slackpad For me, the problem that this causes is that /v1/catalog/service/service endpoint, still shows services from failed nodes, for me it makes more sense, once it's failed, the services get removed.

slackpad · 2016-08-15T15:47:06Z

@lucaswxp usually clients use the https://www.consul.io/docs/agent/http/health.html#health_service endpoint to find healthy instances (there's a ?passing parameter that will filter to only healthy ones).

MitchFierro · 2018-05-01T17:13:22Z

@slackpad using the health API is not always convenient (the payloads are larger, we are not interested in node/check data, only the service entries), we prefer the /v1/catalog/service/:service API. The problem is that this returns service entries on failed nodes. Is there a way to filter this response to only return services running on known active (not failed) nodes? What is the rationale behind returning service entries for nodes known to be in a failed state?

nikgibbens · 2018-10-15T14:54:52Z

Agreed with @MitchFierro If the /v1/catalog/service/service endpoint could also take a passing=true parameter, this would be preferable in the interest of a smaller response payload.

Dmitry1987 · 2020-11-20T17:34:01Z

While implementing #1935 which makes this configurable, and reviewing it with @sean- we realized that lowering this too much can be fairly dangerous for the case of Consul servers. If there's a partition that isolates a server and this is set low, the server could get kicked prematurely and would need to be re-joined or restarted in order to work again, whereas with the current default setting you'd have to have 72 hours elapse before that's a problem.

Are people mostly worried about clutter in consul members, or is there some other problem people are hoping will be fixed by making this configurable?

I have alerts firing for nodes that scaled-in by autoscaling policy... what's the official way to avoid having scaled-in services hanging and alerting in consul? Does force-leave fix this?

armon added the type/bug Feature does not function as expected label May 7, 2015

This was referenced Apr 10, 2016

Allow reap to be configurable #1435

Closed

Makes reap time configurable for LAN and WAN. #1935

Merged

slackpad closed this as completed in #1935 Apr 20, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consul left with many failed nodes (former agents) over time in cloud environment #544

Consul left with many failed nodes (former agents) over time in cloud environment #544

owaaa commented Dec 16, 2014

armon commented Dec 17, 2014

c4milo commented May 5, 2015

armon commented May 7, 2015

c4milo commented May 7, 2015

armon commented May 7, 2015

c4milo commented May 7, 2015

c4milo commented May 7, 2015

c4milo commented May 12, 2015

armon commented May 12, 2015

c4milo commented May 12, 2015

gtmtech commented Jun 11, 2015

bryanwb commented Nov 11, 2015

atrbgithub commented Nov 20, 2015

c4milo commented Nov 20, 2015

goacid commented Nov 23, 2015

s1l0uk commented Jan 19, 2016

developerinlondon commented Jan 22, 2016

erkules commented Feb 18, 2016

naydencho commented Feb 22, 2016

dim commented Mar 21, 2016

Poogles commented Mar 21, 2016

nadirollo commented Apr 7, 2016

ThomasGilbert commented Apr 8, 2016

slackpad commented Apr 11, 2016

slackpad commented Apr 11, 2016

lucaswxp commented Aug 15, 2016

slackpad commented Aug 15, 2016

MitchFierro commented May 1, 2018 •

edited

Loading

nikgibbens commented Oct 15, 2018

Dmitry1987 commented Nov 20, 2020

Consul left with many failed nodes (former agents) over time in cloud environment #544

Consul left with many failed nodes (former agents) over time in cloud environment #544

Comments

owaaa commented Dec 16, 2014

armon commented Dec 17, 2014

c4milo commented May 5, 2015

armon commented May 7, 2015

c4milo commented May 7, 2015

armon commented May 7, 2015

c4milo commented May 7, 2015

c4milo commented May 7, 2015

c4milo commented May 12, 2015

armon commented May 12, 2015

c4milo commented May 12, 2015

gtmtech commented Jun 11, 2015

bryanwb commented Nov 11, 2015

atrbgithub commented Nov 20, 2015

c4milo commented Nov 20, 2015

goacid commented Nov 23, 2015

s1l0uk commented Jan 19, 2016

developerinlondon commented Jan 22, 2016

erkules commented Feb 18, 2016

naydencho commented Feb 22, 2016

dim commented Mar 21, 2016

Poogles commented Mar 21, 2016

nadirollo commented Apr 7, 2016

ThomasGilbert commented Apr 8, 2016

slackpad commented Apr 11, 2016

slackpad commented Apr 11, 2016

lucaswxp commented Aug 15, 2016

slackpad commented Aug 15, 2016

MitchFierro commented May 1, 2018 • edited Loading

nikgibbens commented Oct 15, 2018

Dmitry1987 commented Nov 20, 2020

MitchFierro commented May 1, 2018 •

edited

Loading