You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The work we did recently to integrate host health monitoring into the runtime (62aa143) has resulted in some production issues for customers. The symptoms are repeated Host thresholds exceeded: [Connections] errors in host logs, without the host ever stabilizing. The changes were intended to proactively identify when the host is nearing environmental sandbox thresholds (e.g. 300 connections) and temporarily spin down the host so it stops processing new work for a period of time until the host stabilizes and becomes healthy again. In Consumption plans, other instances will continue processing while the host over threshold recovers.
In the previous feature work, these connection errors were expected when the host was nearing thresholds, however the intent was that the health check would identify and log this issue, and resolve it via a host restart, which was expected to free up resources. It appears that in many cases this isn't enough - e.g. connections that are leaked statically in the app domain and aren't freed on the restart. In many cases this will be due to poorly written functions that don't manage connections well.
The plan is to improve the periodic background health check by adding a simple sliding window + threshold that will recycle the app domain (i.e. HostingEnvironment.Shutdown()) if the host has been unhealthy for a number of times exceeding the threshold, within the sliding time window. This addresses the cases where apps are getting “stuck” in a host restart loop, unable to recover. The time window spans multiple restart attempts, so can catch situations where the host is restarting continuously without recovering due to connection threshold oscillations. The current host restart appears not to be enough to clean up connection leaks/overages in all cases.
We'll make the window/threshold and health check intervals configurable (with good defaults). E.g. check health every 15 seconds, with a 15 minute sliding window and a 30 unhealthy count threshold. That window yields a total of 60 health checks, so the threshold of 30 would represent 50%. In one case I was looking at, the customer’s app had 47 unhealthy host checks within a 15 minute window.
The text was updated successfully, but these errors were encountered:
This sounds good, though I think we might need to tweak the numbers. An unhealthy count of 30 with a check frequency of 15 seconds means it would take at least 7 minutes from the first unhealthy result to lead to an app domain recycle. This is much too long. I would like to get that 7 minutes down to 1 or 2.
The work we did recently to integrate host health monitoring into the runtime (62aa143) has resulted in some production issues for customers. The symptoms are repeated Host thresholds exceeded: [Connections] errors in host logs, without the host ever stabilizing. The changes were intended to proactively identify when the host is nearing environmental sandbox thresholds (e.g. 300 connections) and temporarily spin down the host so it stops processing new work for a period of time until the host stabilizes and becomes healthy again. In Consumption plans, other instances will continue processing while the host over threshold recovers.
In the previous feature work, these connection errors were expected when the host was nearing thresholds, however the intent was that the health check would identify and log this issue, and resolve it via a host restart, which was expected to free up resources. It appears that in many cases this isn't enough - e.g. connections that are leaked statically in the app domain and aren't freed on the restart. In many cases this will be due to poorly written functions that don't manage connections well.
The plan is to improve the periodic background health check by adding a simple sliding window + threshold that will recycle the app domain (i.e. HostingEnvironment.Shutdown()) if the host has been unhealthy for a number of times exceeding the threshold, within the sliding time window. This addresses the cases where apps are getting “stuck” in a host restart loop, unable to recover. The time window spans multiple restart attempts, so can catch situations where the host is restarting continuously without recovering due to connection threshold oscillations. The current host restart appears not to be enough to clean up connection leaks/overages in all cases.
We'll make the window/threshold and health check intervals configurable (with good defaults). E.g. check health every 15 seconds, with a 15 minute sliding window and a 30 unhealthy count threshold. That window yields a total of 60 health checks, so the threshold of 30 would represent 50%. In one case I was looking at, the customer’s app had 47 unhealthy host checks within a 15 minute window.
The text was updated successfully, but these errors were encountered: