Brings up Health Monitor HTTP server faster #2537

klakin-pivotal · 2024-07-17T22:51:46Z

PR Summary

This commit adds a guard around the '/unresponsive_agents' endpoint so that it will return a "not successful" HTTP status code (in this case, 503) if the initial "query all Deployments and their Instances" action run by the Director has not yet completed.

To support that, it pushes down the guts of the fetch_deployments function into the InstanceManager class.

This commit also moves the start of the Health Monitor HTTP server to nearly the top of the 'Monitor#run' function. This should get the '/healthz' endpoint started as quickly as possible so that we don't get terminated by monit just because querying the state of the Director-managed deployments takes longer than ten seconds.

Why do this?

We've discovered that Monit gives a monitored service ~10 seconds to bring up its health-checking HTTP server before Monit declares the service dead and restarts it. What this means is that Director Health Monitors that are on underprovisioned VMs and/or that have large numbers of Deployments and/or Instances make take longer than 10 seconds to come up, and may find themselves in an unending restart loop.

The guard around the '/unresponsive_agents' endpoint is added to preserve the previous behavior of "Calls to /unresponsive_agents do not succeed until the initial query of Director-managed deployments completes.".

Things to note

The changes to notifying_plugins_spec.rb should be somewhat-carefully reviewed. Moving the location of the HTTP server start causes a bunch of "Health monitor failed to connect to director" messages to get put in the Bosh::Monitor::Plugins::Dummy plugin event queue. Given that we can scan through all of the messages in the queue and eventually find the one we expect, it seems pretty clear that the test as written assumed that there would only ever be a single message in the event queue.

I did not bother to find out why the "when health monitor fails to fetch deployments" test succeeds. Perhaps it succeeds by coincidence, given the nature of the new failure messages?

I was unable to discover where the Bosh Director (or stub of the same) that this thing contacts was being brought up.

What is this change about?

See above.

What tests have you run against this PR?

I have run most of the unit tests.

How should this change be described in bosh release notes?

Improves Health Monitor startup reliability by bringing up the Health Monitor HTTP server as fast as is possible.

Does this PR introduce a breaking change?

I do not believe so. The user-visible behavior change is as follows:

Prior to this change, attempts to access the Health Monitor's /unresponsive_agents endpoint would fail with "connection refused" until the first survey of the healthiness of all deployments had completed.
After this change, attempts to access the Health Monitor's /unresponsive_agents endpoint will return 503 until the first survey of the healthiness of all deployments had completed.

Both are unsuccessful returns, and both indicate server failure, rather than something that could be corrected client-side, so I believe that this is not a breaking change.

Tag your pair, your PM, and/or team!

@aramprice

This commit adds a guard around the '/unresponsive_agents' endpoint so that it will return a "not successful" HTTP status code (in this case, 503) if the initial "query all Deployments and their Instances" action run by the Director has not yet completed. To support that, it pushes down the guts of the fetch_deployments function into the InstanceManager class. This commit also moves the start of the Health Monitor HTTP server to nearly the top of the 'Monitor#run' function. This should get the '/healthz' endpoint started as quickly as possible so that we don't get terminated by monit just because querying the state of the Director-managed deployments takes longer than ten seconds. Why do this? We've discovered that Monit gives a monitored service ~10 seconds to bring up its health-checking HTTP server before Monit declares the service dead and restarts it. What this means is that Director Health Monitors that are on underprovisioned VMs and/or that have large numbers of Deployments and/or Instances make take longer than 10 seconds to come up, and may find themselves in an unending restart loop. The guard around the '/unresponsive_agents' endpoint is added to preserve the previous behavior of "Calls to /unresponsive_agents do not succeed until the initial query of Director-managed deployments completes.". [#187938284] [JIRA] BOSH-296 "Health monitor continuously exiting for foundation with large number of VMs" Signed-off-by: Aram Price <aram.price@broadcom.com>

beyhan · 2024-07-18T14:37:27Z

Issue #2524 could be related.

klakin-pivotal · 2024-07-19T16:29:58Z

Yeah, any situation where the Health Monitor is continually getting restarted every ten seconds, and not reporting some sort of failure in its logs is likely related to this.

Correct PR #2537

klakin-pivotal requested review from a team, aramprice and nouseforaname and removed request for a team July 17, 2024 22:56

aramprice approved these changes Jul 18, 2024

View reviewed changes

klakin-pivotal merged commit 9c2fe7d into main Jul 19, 2024
4 checks passed

klakin-pivotal removed the request for review from nouseforaname July 19, 2024 16:27

aramprice deleted the health-monitor-faster-bringup branch July 22, 2024 16:30

klakin-pivotal mentioned this pull request Jul 22, 2024

Correct PR #2537 #2538

Merged

klakin-pivotal added a commit that referenced this pull request Jul 22, 2024

Merge pull request #2538 from cloudfoundry/correct-pr-2537

c453076

Correct PR #2537

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Brings up Health Monitor HTTP server faster #2537

Brings up Health Monitor HTTP server faster #2537

klakin-pivotal commented Jul 17, 2024

beyhan commented Jul 18, 2024

klakin-pivotal commented Jul 19, 2024

Brings up Health Monitor HTTP server faster #2537

Brings up Health Monitor HTTP server faster #2537

Conversation

klakin-pivotal commented Jul 17, 2024

PR Summary

Things to note

What is this change about?

What tests have you run against this PR?

How should this change be described in bosh release notes?

Does this PR introduce a breaking change?

Tag your pair, your PM, and/or team!

beyhan commented Jul 18, 2024

klakin-pivotal commented Jul 19, 2024