Brings up Health Monitor HTTP server faster #2537
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR Summary
This commit adds a guard around the '/unresponsive_agents' endpoint so that it will return a "not successful" HTTP status code (in this case, 503) if the initial "query all Deployments and their Instances" action run by the Director has not yet completed.
To support that, it pushes down the guts of the fetch_deployments function into the InstanceManager class.
This commit also moves the start of the Health Monitor HTTP server to nearly the top of the 'Monitor#run' function. This should get the '/healthz' endpoint started as quickly as possible so that we don't get terminated by monit just because querying the state of the Director-managed deployments takes longer than ten seconds.
Why do this?
We've discovered that Monit gives a monitored service ~10 seconds to bring up its health-checking HTTP server before Monit declares the service dead and restarts it. What this means is that Director Health Monitors that are on underprovisioned VMs and/or that have large numbers of Deployments and/or Instances make take longer than 10 seconds to come up, and may find themselves in an unending restart loop.
The guard around the '/unresponsive_agents' endpoint is added to preserve the previous behavior of "Calls to /unresponsive_agents do not succeed until the initial query of Director-managed deployments completes.".
Things to note
The changes to
notifying_plugins_spec.rb
should be somewhat-carefully reviewed. Moving the location of the HTTP server start causes a bunch of "Health monitor failed to connect to director" messages to get put in theBosh::Monitor::Plugins::Dummy
plugin event queue. Given that we can scan through all of the messages in the queue and eventually find the one we expect, it seems pretty clear that the test as written assumed that there would only ever be a single message in the event queue.I did not bother to find out why the
"when health monitor fails to fetch deployments"
test succeeds. Perhaps it succeeds by coincidence, given the nature of the new failure messages?I was unable to discover where the Bosh Director (or stub of the same) that this thing contacts was being brought up.
What is this change about?
See above.
What tests have you run against this PR?
I have run most of the unit tests.
How should this change be described in bosh release notes?
Improves Health Monitor startup reliability by bringing up the Health Monitor HTTP server as fast as is possible.
Does this PR introduce a breaking change?
I do not believe so. The user-visible behavior change is as follows:
Prior to this change, attempts to access the Health Monitor's
/unresponsive_agents
endpoint would fail with "connection refused" until the first survey of the healthiness of all deployments had completed.After this change, attempts to access the Health Monitor's
/unresponsive_agents
endpoint will return 503 until the first survey of the healthiness of all deployments had completed.Both are unsuccessful returns, and both indicate server failure, rather than something that could be corrected client-side, so I believe that this is not a breaking change.
Tag your pair, your PM, and/or team!
@aramprice