Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HostHealthMonitor Improvements #2115

Closed
mathewc opened this issue Nov 10, 2017 · 1 comment
Closed

HostHealthMonitor Improvements #2115

mathewc opened this issue Nov 10, 2017 · 1 comment
Assignees
Milestone

Comments

@mathewc
Copy link
Member

mathewc commented Nov 10, 2017

The work we did recently to integrate host health monitoring into the runtime (62aa143) has resulted in some production issues for customers. The symptoms are repeated Host thresholds exceeded: [Connections] errors in host logs, without the host ever stabilizing. The changes were intended to proactively identify when the host is nearing environmental sandbox thresholds (e.g. 300 connections) and temporarily spin down the host so it stops processing new work for a period of time until the host stabilizes and becomes healthy again. In Consumption plans, other instances will continue processing while the host over threshold recovers.

In the previous feature work, these connection errors were expected when the host was nearing thresholds, however the intent was that the health check would identify and log this issue, and resolve it via a host restart, which was expected to free up resources. It appears that in many cases this isn't enough - e.g. connections that are leaked statically in the app domain and aren't freed on the restart. In many cases this will be due to poorly written functions that don't manage connections well.

The plan is to improve the periodic background health check by adding a simple sliding window + threshold that will recycle the app domain (i.e. HostingEnvironment.Shutdown()) if the host has been unhealthy for a number of times exceeding the threshold, within the sliding time window. This addresses the cases where apps are getting “stuck” in a host restart loop, unable to recover. The time window spans multiple restart attempts, so can catch situations where the host is restarting continuously without recovering due to connection threshold oscillations. The current host restart appears not to be enough to clean up connection leaks/overages in all cases.

We'll make the window/threshold and health check intervals configurable (with good defaults). E.g. check health every 15 seconds, with a 15 minute sliding window and a 30 unhealthy count threshold. That window yields a total of 60 health checks, so the threshold of 30 would represent 50%. In one case I was looking at, the customer’s app had 47 unhealthy host checks within a 15 minute window.

@paulbatum
Copy link
Member

This sounds good, though I think we might need to tweak the numbers. An unhealthy count of 30 with a check frequency of 15 seconds means it would take at least 7 minutes from the first unhealthy result to lead to an app domain recycle. This is much too long. I would like to get that 7 minutes down to 1 or 2.

@paulbatum paulbatum added this to the Triaged milestone Nov 16, 2017
@paulbatum paulbatum modified the milestones: Triaged, Sprint 12 Nov 29, 2017
@mathewc mathewc closed this as completed Dec 5, 2017
@ghost ghost locked as resolved and limited conversation to collaborators Jan 1, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants