Skip to content

Latest commit

 

History

History
127 lines (81 loc) · 6.7 KB

File metadata and controls

127 lines (81 loc) · 6.7 KB

Health Service

The HealthService is an app component that is running along other components like CatalogService and BackgroundProcessor on the compute cluster. It provides a REST API that is being called by Azure Front Door to determine the health of a stamp (region). Unlike basic liveness probes, which are present on every API, health service is a more complex component which reflects the state of dependencies, in addition to its own.

HealthService conceptual diagram

The idea is, first of all, if the cluster itself is down, the health service won't respond at all. When the service is up and running, it performs periodic checks against various components of the solution:

  • It attempts to do a simple query against Cosmos DB (read and write)
  • It attempts to send a message to Event Hub (the message will be filtered out by the background worker)
  • It looks up a state file on the storage account. This file can be used to turn off a region, even while the other checks are still working ok.

In addition to the direct "pings" to the downstream services, the HealthService also queries Azure Monitor for the current HealthScore (as defined in the Health Model). While there is an ingestion latency with Azure Monitor (typically a few minutes), the HealthScore does also take into account additional signals which are not covered by the pings of the HealthService.

Hence, the combination of both enables the HealthService to get a real-time picture of the most critical dependencies of the API, as well as a holistic view of the entire stamp as defined in the health model.

All health check results are performed by a background worker (HealthJob.cs) and cached in memory for a configurable number of seconds (by default 10) so that not every call to the API results in backend calls. While this does add a small potential latency in detecting outages, it also reduces the additional cluster load generated by health checks.

Configuration

Refer to CatalogService configuration for details of the implementation.

Apart from the configuration settings which are common between components, such as Cosmos DB connection settings, the following settings are used exclusively by the HealthService:

  • HealthServiceCacheDurationSeconds: Controls the expiration time of memory cache, in seconds.
  • HealthServiceStorageConnectionString: Connection string for the Storage Account where the status file should be present.
  • HealthServiceBlobContainerName: Storage Container where the status file should be present.
  • HealthServiceBlobName: Name of the status file - health check will look for this.
  • HealthServiceOverallTimeoutSeconds: Timeout for the whole check - defaults to 3 seconds. If the check doesn't finish in this interval, the service reports unhealthy.
  • HealthServiceAzMonitorHealthStatusQuery: Kusto query which is used to retrieve the Health Status. See below for default.

Individual health checks can be disabled by adding an application setting like:

HEALTHSERVICE_CHECK_<check name>_DISABLED = "true"

Currently the available health checks are:

  • AzMonitorHealthScore
  • BlobStorage
  • Database
  • MessageProducer

To disable, for instance, the Blob Storage health check, add an application setting to the HealthService:

HEALTHSERVICE_CHECK_BLOBSTORAGE_DISABLED = "true"

Implementation

All checks are done asynchronously and in parallel. If either of them fails, the whole stamp will be considered unavailable.

Check results are cached in memory. Cache expiration is controlled by SysConfig.HealthServiceCacheDurationSeconds and is set to 10 seconds by default.

This reduces the additional load generated by health checks as not every request will result in downstream call to the dependent services.

Blob check

The blob check currently serves two purposes:

  1. Test if it's possible to reach Blob Storage. This storage account is also used by other components in the stamp and hence considered a critical resource.
  2. Manually "turn off" a region by manipulating (i.e. deleting) the state file.

We decided that this check should only look for the presence of a state file in the specified Blob Container, but not process its content in any way. There is also the possibility to set up a more sophisticated system which would read the content of the file and return different status based on that (such as "HEALTHY", "UNHEALTHY", "MAINTENANCE" etc.).

Remove the state file to disable a stamp.

Make sure the file is present after deploying the application - otherwise the health service will always respond with UNHEALTHY and Front Door will not recognize the backend as available. This file does get created by Terraform so it should be present after the infrastructure deployment.

Event Hub check

Event Hub health reporting is handled by the EventHubProducerService. This service reports healthy if it's able to send a new message to Event Hub. For filtering, this message has an identifying property added to it:

HEALTHCHECK=TRUE

This message is ignored on the receiving end (AlwaysOn.BackgroundProcessor.EventHubProcessorService.ProcessEventHanderAsync()), which checks for the HEALTHCHECK property.

Cosmos DB check

Cosmos DB health reporting is handled by the CosmosDbService, which reports healthy if it is:

  • Able to connect to Cosmos DB database and perform a simple query.
  • Able to write a test document to the database (the test document has a very short Time-to-Live set, so Cosmos DB automatically removes it).

The HealthService is doing two separate probes since Cosmos DB could be in a state in which reads still work, but writing documents does not.

For the Read-only query, the following query is being used, which doesn't fetch any data and doesn't have large impact on overall load:

SELECT GetCurrentDateTime ()

The write query creates a dummy ItemRating with minimum content:

var testRating = new ItemRating()
{
    Id = Guid.NewGuid(),
    CatalogItemId = Guid.NewGuid(), // Create some random (=non-existing) item id
    CreationDate = DateTime.UtcNow,
    Rating = 1,
    TimeToLive = 10 // will be auto-deleted after 10sec
};

await AddNewRatingAsync(testRating);

Azure Monitor Health Status Query

The regional Azure Monitor Log Analytics workspace is queried for the latest Health Status. If that is equal or below a certain threshold, it is considered unhealthy. The query can also be configured (HEALTHSERVICE_AZMONITOR_HEALTHSTATUS_QUERY) currently it uses the following KQL query:

StampHealthScore 
| order by TimeGenerated desc 
| take 1 
| project TimeGenerated, Healthy=tobool(1-RedScore)

Back to documentation root