Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove service definition from agent after TTL #679

Closed
ajf opened this issue Feb 8, 2015 · 17 comments
Closed

Remove service definition from agent after TTL #679

ajf opened this issue Feb 8, 2015 · 17 comments
Labels
type/enhancement Proposed improvement or new feature

Comments

@ajf
Copy link

ajf commented Feb 8, 2015

In an environment such as Apache Mesos it is common, and expected that services move around between nodes based on re-deployment of applications. For instance, if I launch a Rails application with 3 instances, they may get launched on node1, node5, and node 40, and when I upgrade it and redeploy it they may get launched on node2, and two on node33 (with different ports).

Is there any good way of dealing with this?

What ends up happening (with TTL based checks) is that the checks on the old nodes will go critical and will stay that way, even though the service will not come back to that node for a while. Short of implementing a clean-up process (I can't ensure the clean up process runs at all), is there anything that can help me? Perhaps having a service delete itself from an agent when the TTL check expires or something?

@blalor
Copy link
Contributor

blalor commented Feb 8, 2015

You need to remove the service from the nodes that no longer have it, probably just before the service is stopped on that node.

@ajf
Copy link
Author

ajf commented Feb 8, 2015

@blalor unfortunately that's not always possible. For example, I'd register the service from the Mesos executor process, this process could get OOM killed (since it runs inside a control group) and then I'd be unable to clean up the service definition.

What we're doing now is using etcd to add a key for the service with a TTL, and constantly refresh the TTL. If the TTL expires the key is deleted. There's no other process that needs to run to clean up stale etcd keys, which really makes operations much more simple.

It'd be interesting if something like this is possible, or if we need to write cron jobs that can somehow reconcile stale service definitions with services that should actually exist? The latter seems really lame.

@blalor
Copy link
Contributor

blalor commented Feb 8, 2015

So you're registering the service with the agent as part of the lifecycle of your application? It seems like many folks are running some kind of sidekick process that handles the (de-)registration external to the actual application. etcd's TTLs do sound interesting, however.

@armon
Copy link
Member

armon commented Feb 17, 2015

@ajf Sounds like what may work is to have a TTL based check, but to specify new behavior such that if it fails the entire service should be de-registered. Does that seem reasonable?

@armon armon closed this as completed Feb 17, 2015
@armon armon reopened this Feb 17, 2015
@armon armon added the type/enhancement Proposed improvement or new feature label Feb 17, 2015
@pawelchcki
Copy link

I actually have very similar setup also using Mesos and TTL based check for Consul so when the service gets OOM killed it leaves old registration entry that then needs to be cleaned up by some Cron job.

Since in the future I plan to use HTTP checks instead of TTL so I think it would be best if option for automatic service deregistration would work with all types of check supported by Consul.

@ajf
Copy link
Author

ajf commented Feb 28, 2015

@armon exactly right.

@camerondavison
Copy link
Contributor

I am not sure where this issue ended up, but I feel like the title is about what I would like to see.

If you are using a sidekick or something like registrator to register the service definition it seems like we could use the same TTL behavior as the keys. This way if things crash and move to another node (through orchestration), they will be unhealthy for a little while until the service definition disappears but then eventually fix themselves up.

@bastichelaar
Copy link

This would be great indeed when using Mesos or any other scheduler. How are you guys solving this issue now? By using a cronjob?

@scalp42
Copy link
Contributor

scalp42 commented Jul 8, 2015

is it still planned @armon by any chance ?

@armon
Copy link
Member

armon commented Jul 10, 2015

@scalp42 Yes, it is still planned

@brycechesternewman
Copy link

@ajf and @armon We can really use this as well.

Perhaps a new boolean attribute called deregisterService. By default value for deregisterService is false for backward compatibility. If deregisterService is true and after TTL is breached the service is deregistered from the catalog.
Json example.
{
"check": {
"id": "web-app",
"name": "Web App Status",
"notes": "Web app does a curl internally every 10 seconds",
"ttl": "30s",
"deregisterService":true
}
}

@kriss9
Copy link

kriss9 commented Aug 4, 2015

Agreed @brycechesternewman -- should be defined at the check and/or the service. The latter would allow for built in checks for agent availability to be leveraged by default.

@cablehead
Copy link

Not sure if this is what you're already thinking this @armon. If service/register supported the acquire parameter, the service could be deleted when it's session ttl expires, the way ephemeral keys currently work.

@nickwales
Copy link
Contributor

If we're removing TTL checked services, can we also set a parameter for non-TTL monitored services to be removed once they have failed n checks?

@OferE
Copy link

OferE commented Jul 3, 2016

+1 - what is the status of this?

@j0hnsmith
Copy link

Great work @slackpad, thanks.

@OferE
Copy link

OferE commented Aug 17, 2016

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement Proposed improvement or new feature
Projects
None yet
Development

No branches or pull requests