Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Brings up Health Monitor HTTP server faster #2537

Merged
merged 1 commit into from
Jul 19, 2024

Conversation

klakin-pivotal
Copy link
Contributor

PR Summary

This commit adds a guard around the '/unresponsive_agents' endpoint so that it will return a "not successful" HTTP status code (in this case, 503) if the initial "query all Deployments and their Instances" action run by the Director has not yet completed.

To support that, it pushes down the guts of the fetch_deployments function into the InstanceManager class.

This commit also moves the start of the Health Monitor HTTP server to nearly the top of the 'Monitor#run' function. This should get the '/healthz' endpoint started as quickly as possible so that we don't get terminated by monit just because querying the state of the Director-managed deployments takes longer than ten seconds.

Why do this?

We've discovered that Monit gives a monitored service ~10 seconds to bring up its health-checking HTTP server before Monit declares the service dead and restarts it. What this means is that Director Health Monitors that are on underprovisioned VMs and/or that have large numbers of Deployments and/or Instances make take longer than 10 seconds to come up, and may find themselves in an unending restart loop.

The guard around the '/unresponsive_agents' endpoint is added to preserve the previous behavior of "Calls to /unresponsive_agents do not succeed until the initial query of Director-managed deployments completes.".

Things to note

The changes to notifying_plugins_spec.rb should be somewhat-carefully reviewed. Moving the location of the HTTP server start causes a bunch of "Health monitor failed to connect to director" messages to get put in the Bosh::Monitor::Plugins::Dummy plugin event queue. Given that we can scan through all of the messages in the queue and eventually find the one we expect, it seems pretty clear that the test as written assumed that there would only ever be a single message in the event queue.

I did not bother to find out why the "when health monitor fails to fetch deployments" test succeeds. Perhaps it succeeds by coincidence, given the nature of the new failure messages?

I was unable to discover where the Bosh Director (or stub of the same) that this thing contacts was being brought up.

What is this change about?

See above.

What tests have you run against this PR?

I have run most of the unit tests.

How should this change be described in bosh release notes?

Improves Health Monitor startup reliability by bringing up the Health Monitor HTTP server as fast as is possible.

Does this PR introduce a breaking change?

I do not believe so. The user-visible behavior change is as follows:

Prior to this change, attempts to access the Health Monitor's /unresponsive_agents endpoint would fail with "connection refused" until the first survey of the healthiness of all deployments had completed.
After this change, attempts to access the Health Monitor's /unresponsive_agents endpoint will return 503 until the first survey of the healthiness of all deployments had completed.

Both are unsuccessful returns, and both indicate server failure, rather than something that could be corrected client-side, so I believe that this is not a breaking change.

Tag your pair, your PM, and/or team!

@aramprice

This commit adds a guard around the '/unresponsive_agents' endpoint so
that it will return a "not successful" HTTP status code (in this case,
503) if the initial "query all Deployments and their Instances" action
run by the Director has not yet completed.

To support that, it pushes down the guts of the fetch_deployments
function into the InstanceManager class.

This commit also moves the start of the Health Monitor HTTP server to
nearly the top of the 'Monitor#run' function. This should get the
'/healthz' endpoint started as quickly as possible so that we don't get
terminated by monit just because querying the state of the
Director-managed deployments takes longer than ten seconds.

Why do this?

We've discovered that Monit gives a monitored service ~10 seconds to
bring up its health-checking HTTP server before Monit declares the
service dead and restarts it. What this means is that Director Health
Monitors that are on underprovisioned VMs and/or that have large numbers
of Deployments and/or Instances make take longer than 10 seconds to come
up, and may find themselves in an unending restart loop.

The guard around the '/unresponsive_agents' endpoint is added to
preserve the previous behavior of "Calls to /unresponsive_agents do not
succeed until the initial query of Director-managed deployments
completes.".

[#187938284] [JIRA] BOSH-296 "Health monitor continuously exiting for foundation with large number of VMs"

Signed-off-by: Aram Price <aram.price@broadcom.com>
@klakin-pivotal klakin-pivotal requested review from a team, aramprice and nouseforaname and removed request for a team July 17, 2024 22:56
@beyhan
Copy link
Member

beyhan commented Jul 18, 2024

Issue #2524 could be related.

@klakin-pivotal klakin-pivotal merged commit 9c2fe7d into main Jul 19, 2024
4 checks passed
@klakin-pivotal klakin-pivotal removed the request for review from nouseforaname July 19, 2024 16:27
@klakin-pivotal
Copy link
Contributor Author

Yeah, any situation where the Health Monitor is continually getting restarted every ten seconds, and not reporting some sort of failure in its logs is likely related to this.

@aramprice aramprice deleted the health-monitor-faster-bringup branch July 22, 2024 16:30
@klakin-pivotal klakin-pivotal mentioned this pull request Jul 22, 2024
klakin-pivotal added a commit that referenced this pull request Jul 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

3 participants