HostHealthMonitor Improvements #2115

mathewc · 2017-11-10T19:17:20Z

The work we did recently to integrate host health monitoring into the runtime (62aa143) has resulted in some production issues for customers. The symptoms are repeated Host thresholds exceeded: [Connections] errors in host logs, without the host ever stabilizing. The changes were intended to proactively identify when the host is nearing environmental sandbox thresholds (e.g. 300 connections) and temporarily spin down the host so it stops processing new work for a period of time until the host stabilizes and becomes healthy again. In Consumption plans, other instances will continue processing while the host over threshold recovers.

In the previous feature work, these connection errors were expected when the host was nearing thresholds, however the intent was that the health check would identify and log this issue, and resolve it via a host restart, which was expected to free up resources. It appears that in many cases this isn't enough - e.g. connections that are leaked statically in the app domain and aren't freed on the restart. In many cases this will be due to poorly written functions that don't manage connections well.

The plan is to improve the periodic background health check by adding a simple sliding window + threshold that will recycle the app domain (i.e. HostingEnvironment.Shutdown()) if the host has been unhealthy for a number of times exceeding the threshold, within the sliding time window. This addresses the cases where apps are getting “stuck” in a host restart loop, unable to recover. The time window spans multiple restart attempts, so can catch situations where the host is restarting continuously without recovering due to connection threshold oscillations. The current host restart appears not to be enough to clean up connection leaks/overages in all cases.

We'll make the window/threshold and health check intervals configurable (with good defaults). E.g. check health every 15 seconds, with a 15 minute sliding window and a 30 unhealthy count threshold. That window yields a total of 60 health checks, so the threshold of 30 would represent 50%. In one case I was looking at, the customer’s app had 47 unhealthy host checks within a 15 minute window.

The text was updated successfully, but these errors were encountered:

paulbatum · 2017-11-16T22:04:55Z

This sounds good, though I think we might need to tweak the numbers. An unhealthy count of 30 with a check frequency of 15 seconds means it would take at least 7 minutes from the first unhealthy result to lead to an app domain recycle. This is much too long. I would like to get that 7 minutes down to 1 or 2.

paulbatum added the improvement label Nov 16, 2017

paulbatum assigned mathewc Nov 16, 2017

paulbatum added this to the Triaged milestone Nov 16, 2017

mathewc mentioned this issue Nov 23, 2017

HostHealthMonitor improvements / reactivation #2156

Merged

paulbatum modified the milestones: Triaged, Sprint 12 Nov 29, 2017

mathewc closed this as completed Dec 5, 2017

ghost locked as resolved and limited conversation to collaborators Jan 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HostHealthMonitor Improvements #2115

HostHealthMonitor Improvements #2115

mathewc commented Nov 10, 2017 •

edited

Loading

paulbatum commented Nov 16, 2017

HostHealthMonitor Improvements #2115

HostHealthMonitor Improvements #2115

Comments

mathewc commented Nov 10, 2017 • edited Loading

paulbatum commented Nov 16, 2017

mathewc commented Nov 10, 2017 •

edited

Loading