Skip to content

Akka.Persistence HealthCheck API #7840

@Aaronontheweb

Description

@Aaronontheweb

Is your feature request related to a problem? Please describe.

Since we shipped https://github.com/akkadotnet/Akka.Hosting/releases/tag/1.5.47-beta1, we've essentially moved Akka.NET's health check implementation from https://github.com/petabridge/akkadotnet-healthcheck to Akka.Hosting - a net positive change that greatly reduces the amount of configuration overhead and installed NuGet packages needed to power Akka.NET's health check system.

However, we are missing one beloved feature from Akka.HealthChecks - the Akka.Persistence checks, which were implemented using the infamous SuicideProbe:

https://github.com/petabridge/akkadotnet-healthcheck/blob/d482d5399be4b9ccd1d39f759cfb6be21384f8bf/src/Akka.HealthCheck.Persistence/AkkaPersistenceLivenessProbe.cs#L248-L295

We need to bring some semblance of this functionality back into the picture in Akka.Hosting's health check implementation.

Describe the solution you'd like

The SuicideProbe, despite its amusing and delightful name, was a bit problematic:

  1. Accidentally polluted journals / snapshot stores - took several rounds of bug-fixing to get right;
  2. Resulted in writeable / billable units for customers running on cloud providers - health checks should run as close to zero cost as possible;
  3. Was a rather complex and somewhat fragile piece of infrastructure; and
  4. Like most of the health checks in Akka.HealthCheck, it was too aggressive - a single persist / recover failure could trigger a liveness check failure. We tried adjusting this via how its parent, the AkkaPersistenceLivenessProbe handled retires, but that proved to be a bit unwieldy too.

So what I'm proposing is we add a new virtual method to the AsyncWriteJournal and SnapshotStore base classes:

enum HealthCheckResult{
 Healthy = 0,
 Degraded = 1, // transient failures
 Unhealthy = 2 // irrecoverable failures
}

public virtual Task<HealthCheckResult> CheckHealthAsync(CancellationToken ct = default);

I think the default base class implementation could just use the CircuitBreaker's Open/Closed status, since that would be a reliable method for determining whether or not the plugin was struggling to perform its work over a recent period of time - AND we could reset the healthcheck status from Degraded --> Healthy when the CircuitBreaker resets.

Describe alternatives you've considered

In some specific Akka.Persistence plugins, such as Akka.Persistence.Sql, you could implementing something akin to the EF Core health checks, which try to open a connection using the provided connection string:

https://github.com/dotnet/aspnetcore/blob/997928da18836abedca802284a907aa42017e87c/src/Middleware/HealthChecks.EntityFrameworkCore/src/DbContextHealthCheck.cs#L11-L14

Additional context

Having the base class implementation be virtual, rather than abstract, ensures that this won't be a breaking change that requires all plugins to be recompiled AND ensures that we can do something useful with the private CircuitBreaker fields.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions