Container Health Checks vs. Self-Termination #157

geekdave · 2018-09-27T15:25:43Z

geekdave
Sep 27, 2018

Following the philosophy of "a container is just a process", my team got to discussing how a running container should diagnose its own health, and what it should do once it's decided that it's unhealthy.

Going with the convention of a /healthz endpoint, when a container is asked about its health, it should respond with a 2xx if it's healthy, and with something else if it's not.

Our main question is:

If a process in a container realizes that it's not healthy, should the container just sit there unhealthy waiting for an orchestrator to terminate it, or should it "self-terminate" by exiting its own process, and causing the docker daemon to respawn it?

If the answer to that is "yes", then is the sole purpose of a health check to guard against the case when a process in a container is fully "hung" and unable to respond to health checks at all? If a container returned a 500 result for a health check should it instead have just taken the liberty of terminating itself?

BretFisher · 2018-09-28T00:01:24Z

BretFisher
Sep 28, 2018
Maintainer

Interesting question, and while I'm not on the leading edge of distributed cloud-native app development, I've not seen that in the wild (apps that run their own healthchecks w/o outside probing).

I would think it's harder to build something inside an app that's running on an interval in the app that will kill the app if it detects a problem. A few additional dilemmas come up in that scenario. These thoughts are in no particular order and more of random ideas, I hope they help:

You'll always have functions in an app that if they misbehave you have the app exit, like a out-of-memory error, or maybe you exit if you have a file write error, or a DB connection error, etc. External connection faults or I/O faults are the most common way to exit an app within its code. But that's not really what we're talking about here, which is code/tools external to the main functions of the app doing things to validate its health.
You'll always need external healthchecks against an app because in many cases if an app is unhealthy it's not able to determine it's own health properly. So now you'll end up with two healthcheck systems... one internal and one external, which adds complexity with an uncertain amount of benefit over just external. Note that external can be another process in the container or an external HTTP call to an app URL.
Docker Swarm and Kubernetes already have the potential for multiple levels of healthchecks "external" to the main app process in the container. If you had "internal healthchecks" those are likely going to end up being additional threads spawned in the app itself, which are acting like their own independent thing in order to remain partial, and you basically have the makings of an outside healthcheck... so again, you will still need the orchestrator and also the monitoring systems "external" checks, so now there are three levels at minimum and I feel like you're again adding complexity for little benefit.
In the case of Docker HEALTHCHECK for Swarm, it doesn't have to be a curl localhost for a 200, it can be an advanced script in the app language. Same for Kubernetes (which has two levels of this same concept). See my node.js repo for an example of how the container healthcheck just runs a separate node script that could do lots of things against the app inside the container. This is usually better than a simple curl.
Your external monitoring system is going to usually use something like the /healthz endpoint for "big picture monitoring" and may or may not react to a problem by taking action against the container, depending on your design. You might leave that up to the orchestrator healthchecks.
The great thing about orchestrators is that they will be smarter about "gracefully failing". For instance in Swarm, if your service healthcheck got something other then a 200, it will send a SIGTERM to the app, and start directing new connections to other containers, but not touch existing connections yet. My Swarm Mastery course goes into this with real testing. Then in your app you can capture that signal and do "nice things" like properly terminate connections and cleanup (assuming it's healthy enough to do so), or wait for the service shutdown timeout, where then Docker will kill the app. If you had the app internals doing all this, it couldn't control new incoming traffic. Kubernetes does something similar with traffic.

I hope all that helps :)

0 replies

geekdave · 2018-09-28T00:12:39Z

geekdave
Sep 28, 2018
Author

@BretFisher Thanks so much for the thorough response. This really clears things up for me!

0 replies

jauerdna · 2018-10-02T02:11:07Z

jauerdna
Oct 2, 2018

Hi @BretFisher, thanks for the feedback to @geekdave - he and I are the ones debating this issue. I'd like your opinion on a more concrete example - we are thinking of deploying git-sync (e.g. https://hub.docker.com/r/openweb/git-sync/) where there are no ports to probe, etc. For monitoring, I was thinking that it would be sufficient to periodically check whether the container is still running (assuming the underlying code will exit when there is an error such as invalid login) vs. trying to write a side-car process that would listen on a TCP port and provide health info (the latter seems like extra overhead). But I am more than willing to be convinced otherwise.

FYI, we are very early in our Docker migration and are currently managing containers via docker-compose, but hopefully that will change as we get further along in your fantastic Docker Mastery course (I vote that you take on Kubernetes next)! :-)

0 replies

BretFisher · 2018-10-09T09:18:26Z

BretFisher
Oct 9, 2018
Maintainer

Thanks for being a fan! Kube is on the roadmap of upcoming courses for sure, expect to see announcements in the coming weeks on my plans.

I don't know about this git-sync thing, it smells of anti-patterns. Hopefully, it's not syncing code, as that goes against the idea of building a docker image from a commit ID that is a direct artifact of the code, which can be deployed and guaranteed to match that commit...

Anyway, "checking if the container is running" is exactly what Swarm and Kubernetes do, so I wouldn't bother with making your own tool. What you'd need is a monitoring system that tracks orchestrator events and alerts you for things you care about. The orchestrators' job is to ensure your service is available and deal with containers and other objects to meet your declarative service definitions. That'll make more sense once you're through Swarm sections of the course.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Container Health Checks vs. Self-Termination #157

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Container Health Checks vs. Self-Termination #157

geekdave Sep 27, 2018

Replies: 4 comments

BretFisher Sep 28, 2018 Maintainer

geekdave Sep 28, 2018 Author

jauerdna Oct 2, 2018

BretFisher Oct 9, 2018 Maintainer

geekdave
Sep 27, 2018

BretFisher
Sep 28, 2018
Maintainer

geekdave
Sep 28, 2018
Author

jauerdna
Oct 2, 2018

BretFisher
Oct 9, 2018
Maintainer