Remove service definition from agent after TTL #679

ajf · 2015-02-08T06:19:12Z

In an environment such as Apache Mesos it is common, and expected that services move around between nodes based on re-deployment of applications. For instance, if I launch a Rails application with 3 instances, they may get launched on node1, node5, and node 40, and when I upgrade it and redeploy it they may get launched on node2, and two on node33 (with different ports).

Is there any good way of dealing with this?

What ends up happening (with TTL based checks) is that the checks on the old nodes will go critical and will stay that way, even though the service will not come back to that node for a while. Short of implementing a clean-up process (I can't ensure the clean up process runs at all), is there anything that can help me? Perhaps having a service delete itself from an agent when the TTL check expires or something?

blalor · 2015-02-08T12:32:26Z

You need to remove the service from the nodes that no longer have it, probably just before the service is stopped on that node.

ajf · 2015-02-08T19:16:54Z

@blalor unfortunately that's not always possible. For example, I'd register the service from the Mesos executor process, this process could get OOM killed (since it runs inside a control group) and then I'd be unable to clean up the service definition.

What we're doing now is using etcd to add a key for the service with a TTL, and constantly refresh the TTL. If the TTL expires the key is deleted. There's no other process that needs to run to clean up stale etcd keys, which really makes operations much more simple.

It'd be interesting if something like this is possible, or if we need to write cron jobs that can somehow reconcile stale service definitions with services that should actually exist? The latter seems really lame.

blalor · 2015-02-08T23:29:23Z

So you're registering the service with the agent as part of the lifecycle of your application? It seems like many folks are running some kind of sidekick process that handles the (de-)registration external to the actual application. etcd's TTLs do sound interesting, however.

armon · 2015-02-17T19:37:51Z

@ajf Sounds like what may work is to have a TTL based check, but to specify new behavior such that if it fails the entire service should be de-registered. Does that seem reasonable?

pawelchcki · 2015-02-27T23:09:22Z

I actually have very similar setup also using Mesos and TTL based check for Consul so when the service gets OOM killed it leaves old registration entry that then needs to be cleaned up by some Cron job.

Since in the future I plan to use HTTP checks instead of TTL so I think it would be best if option for automatic service deregistration would work with all types of check supported by Consul.

ajf · 2015-02-28T16:37:23Z

@armon exactly right.

camerondavison · 2015-05-19T01:42:43Z

I am not sure where this issue ended up, but I feel like the title is about what I would like to see.

If you are using a sidekick or something like registrator to register the service definition it seems like we could use the same TTL behavior as the keys. This way if things crash and move to another node (through orchestration), they will be unhealthy for a little while until the service definition disappears but then eventually fix themselves up.

bastichelaar · 2015-06-11T12:50:02Z

This would be great indeed when using Mesos or any other scheduler. How are you guys solving this issue now? By using a cronjob?

scalp42 · 2015-07-08T23:49:58Z

is it still planned @armon by any chance ?

armon · 2015-07-10T00:20:09Z

@scalp42 Yes, it is still planned

brycechesternewman · 2015-08-01T19:31:27Z

@ajf and @armon We can really use this as well.

Perhaps a new boolean attribute called deregisterService. By default value for deregisterService is false for backward compatibility. If deregisterService is true and after TTL is breached the service is deregistered from the catalog.
Json example.
{
"check": {
"id": "web-app",
"name": "Web App Status",
"notes": "Web app does a curl internally every 10 seconds",
"ttl": "30s",
"deregisterService":true
}
}

kriss9 · 2015-08-04T01:01:47Z

Agreed @brycechesternewman -- should be defined at the check and/or the service. The latter would allow for built in checks for agent availability to be leveraged by default.

cablehead · 2015-08-06T20:12:27Z

Not sure if this is what you're already thinking this @armon. If service/register supported the acquire parameter, the service could be deleted when it's session ttl expires, the way ephemeral keys currently work.

nickwales · 2015-10-19T20:05:28Z

If we're removing TTL checked services, can we also set a parameter for non-TTL monitored services to be removed once they have failed n checks?

OferE · 2016-07-03T09:00:42Z

+1 - what is the status of this?

j0hnsmith · 2016-08-17T08:05:08Z

Great work @slackpad, thanks.

OferE · 2016-08-17T10:42:03Z

thanks!

armon closed this as completed Feb 17, 2015

armon reopened this Feb 17, 2015

armon added the type/enhancement Proposed improvement or new feature label Feb 17, 2015

stevendborrelli mentioned this issue Aug 8, 2015

mesos-consul doesn't de-register inactive services after restart mantl/mesos-consul#15

Closed

dankraw mentioned this issue Nov 19, 2015

Service deregistration after healthcheck failure #679 #1432

Closed

Unix4ever mentioned this issue Nov 19, 2015

Added unhealthy_timeout parameter to the healthcheck definition. #1433

Closed

j0hnsmith mentioned this issue Dec 8, 2015

Cleanup services when TTL expired autopilotpattern/consul#6

Closed

slackpad mentioned this issue Aug 16, 2016

Adds ability to deregister a service based on critical check state longer than a timeout. #2276

Merged

slackpad closed this as completed in #2276 Aug 16, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove service definition from agent after TTL #679

Remove service definition from agent after TTL #679

ajf commented Feb 8, 2015

blalor commented Feb 8, 2015

ajf commented Feb 8, 2015

blalor commented Feb 8, 2015

armon commented Feb 17, 2015

pawelchcki commented Feb 27, 2015

ajf commented Feb 28, 2015

camerondavison commented May 19, 2015

bastichelaar commented Jun 11, 2015

scalp42 commented Jul 8, 2015

armon commented Jul 10, 2015

brycechesternewman commented Aug 1, 2015

kriss9 commented Aug 4, 2015

cablehead commented Aug 6, 2015

nickwales commented Oct 19, 2015

OferE commented Jul 3, 2016

j0hnsmith commented Aug 17, 2016

OferE commented Aug 17, 2016

Remove service definition from agent after TTL #679

Remove service definition from agent after TTL #679

Comments

ajf commented Feb 8, 2015

blalor commented Feb 8, 2015

ajf commented Feb 8, 2015

blalor commented Feb 8, 2015

armon commented Feb 17, 2015

pawelchcki commented Feb 27, 2015

ajf commented Feb 28, 2015

camerondavison commented May 19, 2015

bastichelaar commented Jun 11, 2015

scalp42 commented Jul 8, 2015

armon commented Jul 10, 2015

brycechesternewman commented Aug 1, 2015

kriss9 commented Aug 4, 2015

cablehead commented Aug 6, 2015

nickwales commented Oct 19, 2015

OferE commented Jul 3, 2016

j0hnsmith commented Aug 17, 2016

OferE commented Aug 17, 2016