-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove service definition from agent after TTL #679
Comments
You need to remove the service from the nodes that no longer have it, probably just before the service is stopped on that node. |
@blalor unfortunately that's not always possible. For example, I'd register the service from the Mesos executor process, this process could get OOM killed (since it runs inside a control group) and then I'd be unable to clean up the service definition. What we're doing now is using etcd to add a key for the service with a TTL, and constantly refresh the TTL. If the TTL expires the key is deleted. There's no other process that needs to run to clean up stale etcd keys, which really makes operations much more simple. It'd be interesting if something like this is possible, or if we need to write cron jobs that can somehow reconcile stale service definitions with services that should actually exist? The latter seems really lame. |
So you're registering the service with the agent as part of the lifecycle of your application? It seems like many folks are running some kind of sidekick process that handles the (de-)registration external to the actual application. etcd's TTLs do sound interesting, however. |
@ajf Sounds like what may work is to have a TTL based check, but to specify new behavior such that if it fails the entire service should be de-registered. Does that seem reasonable? |
I actually have very similar setup also using Mesos and TTL based check for Consul so when the service gets OOM killed it leaves old registration entry that then needs to be cleaned up by some Cron job. Since in the future I plan to use HTTP checks instead of TTL so I think it would be best if option for automatic service deregistration would work with all types of check supported by Consul. |
@armon exactly right. |
I am not sure where this issue ended up, but I feel like the title is about what I would like to see. If you are using a sidekick or something like registrator to register the service definition it seems like we could use the same TTL behavior as the keys. This way if things crash and move to another node (through orchestration), they will be unhealthy for a little while until the service definition disappears but then eventually fix themselves up. |
This would be great indeed when using Mesos or any other scheduler. How are you guys solving this issue now? By using a cronjob? |
is it still planned @armon by any chance ? |
@scalp42 Yes, it is still planned |
@ajf and @armon We can really use this as well. Perhaps a new boolean attribute called deregisterService. By default value for deregisterService is false for backward compatibility. If deregisterService is true and after TTL is breached the service is deregistered from the catalog. |
Agreed @brycechesternewman -- should be defined at the check and/or the service. The latter would allow for built in checks for agent availability to be leveraged by default. |
Not sure if this is what you're already thinking this @armon. If service/register supported the |
If we're removing TTL checked services, can we also set a parameter for non-TTL monitored services to be removed once they have failed n checks? |
+1 - what is the status of this? |
Great work @slackpad, thanks. |
thanks! |
In an environment such as Apache Mesos it is common, and expected that services move around between nodes based on re-deployment of applications. For instance, if I launch a Rails application with 3 instances, they may get launched on node1, node5, and node 40, and when I upgrade it and redeploy it they may get launched on node2, and two on node33 (with different ports).
Is there any good way of dealing with this?
What ends up happening (with TTL based checks) is that the checks on the old nodes will go critical and will stay that way, even though the service will not come back to that node for a while. Short of implementing a clean-up process (I can't ensure the clean up process runs at all), is there anything that can help me? Perhaps having a service delete itself from an agent when the TTL check expires or something?
The text was updated successfully, but these errors were encountered: