Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ResourceUtilizationHealthCheck only reports CPU _or_ memory issues, not both #4677

Closed
Tratcher opened this issue Nov 7, 2023 · 14 comments · Fixed by #5407
Closed

ResourceUtilizationHealthCheck only reports CPU _or_ memory issues, not both #4677

Tratcher opened this issue Nov 7, 2023 · 14 comments · Fixed by #5407
Labels
area-fundamentals bug This issue describes a behavior which is not expected - a bug. help wanted Up for grabs. We would accept a PR to help resolve this issue work in progress 🚧

Comments

@Tratcher
Copy link
Member

Tratcher commented Nov 7, 2023

Description

When checking health based on resource utilization, the conditions are checked in series and only the first failure is reported.

public Task<HealthCheckResult> CheckHealthAsync(HealthCheckContext context, CancellationToken cancellationToken = default)
{
var utilization = _dataTracker.GetUtilization(_options.SamplingWindow);
if (utilization.CpuUsedPercentage > _options.CpuThresholds.UnhealthyUtilizationPercentage)
{
return Task.FromResult(HealthCheckResult.Unhealthy("CPU usage is above the limit"));
}
if (utilization.MemoryUsedPercentage > _options.MemoryThresholds.UnhealthyUtilizationPercentage)
{
return Task.FromResult(HealthCheckResult.Unhealthy("Memory usage is above the limit"));
}
if (utilization.CpuUsedPercentage > _options.CpuThresholds.DegradedUtilizationPercentage)
{
return Task.FromResult(HealthCheckResult.Degraded("CPU usage is close to the limit"));
}
if (utilization.MemoryUsedPercentage > _options.MemoryThresholds.DegradedUtilizationPercentage)
{
return Task.FromResult(HealthCheckResult.Degraded("Memory usage is close to the limit"));
}

Reproduction Steps

Test on a system with both high CPU and memory.

Expected behavior

Both CPU and memory issues should be reported.

Actual behavior

Only the first issue found is reported.

Regression?

No response

Known Workarounds

No response

Configuration

8.0

Other information

Recommendation: ResourceUtilizationHealthCheck should be split into two IHealthCheck implementations, one for CPU and one for memory. That way both states can be reported independently.

@Tratcher Tratcher changed the title ResourceUtilizationHealthCheck only reports CPU _or_ memory, not both ResourceUtilizationHealthCheck only reports CPU _or_ memory issues, not both Nov 7, 2023
@geeknoid geeknoid added area-fundamentals bug This issue describes a behavior which is not expected - a bug. and removed untriaged labels Nov 22, 2023
@geeknoid geeknoid added this to the 8.1 milestone Nov 27, 2023
@joperezr joperezr modified the milestones: 8.1, 8.2 Jan 22, 2024
@joperezr joperezr modified the milestones: 8.2, 8.6 May 2, 2024
@xakep139 xakep139 removed this from the 8.6 milestone Jun 3, 2024
@willibrandon
Copy link
Contributor

I just bumped into this myself and agree with @Tratcher’s recommendation.

Recommendation: ResourceUtilizationHealthCheck should be split into two IHealthCheck implementations, one for CPU and one for memory. That way both states can be reported independently.

Anything I can do to help?

@RussKie
Copy link
Member

RussKie commented Jul 25, 2024

Thank you for bumping this issue, @willibrandon. To me it totally makes sense.

@geeknoid @rsreepathi @mwierzchowski @mobratil what do you think of this?

@RussKie
Copy link
Member

RussKie commented Jul 26, 2024

To start the ball rolling, we'll need an API proposal which we'll then discuss and take to the ASP.NET API Review Board for further approval.

@RussKie
Copy link
Member

RussKie commented Jul 29, 2024

/cc: @evgenyfedorov2

@evgenyfedorov2
Copy link
Contributor

Hi @willibrandon , thanks for stepping in.

It does not look like we have to make any changes to public API surface, we probably can just create those two proposed IHealthCheck implementations and register both via the existing extension methods.

What do you think?

@evgenyfedorov2 evgenyfedorov2 added the help wanted Up for grabs. We would accept a PR to help resolve this issue label Jul 29, 2024
@willibrandon
Copy link
Contributor

willibrandon commented Aug 27, 2024

@evgenyfedorov2 - I apologize I am usually more prompt than this.

Yes I agree that we can just create those two proposed IHealthCheck implementations, but I also think that the health of each component should be included in the HealthCheckResult using the optional IReadOnlyDictionary<string, object> data parameter.

@willibrandon
Copy link
Contributor

Or perhaps it would be more appropriate to return a HealthReport?

Represents the result of executing a group of IHealthCheck instances.

@RussKie
Copy link
Member

RussKie commented Aug 28, 2024

Or perhaps it would be more appropriate to return a HealthReport?

Represents the result of executing a group of IHealthCheck instances.

From my observations, dotnet/aspnetcore has two implementations:

Task<HealthCheckResult> CheckHealthAsync(...)
Task<HealthReport> CheckHealthAsync(...)

The first one is used whenever there is a single component to be checked for health (this is the part of the IHealthCheck contract). The last one is used whenever there are multiple components involved (this is part of HealthCheckService contract).

In our case, we use the first signature, however, we check multiple components (i.e., CPU and memory), and this doesn't feel correct. I believe, we should have defined multiple health checks - one for the CPU utilisation, one for memory consumption, etc. Then those health checks should get registered via the standard mechanisms and aggregated via HealthCheckService implementations.

@sebastienros
Copy link
Member

IMO the component should keep doing what it is meant for (resource utilization health checks) and the simplest fix would be to just send a compound message per health level:

  • if CPU and memory are unhealthy send a message like "CPU and Memory usages are above their limits"
  • if only CPU or memory is unhealthy keep the current message (independently of the other being in degraded status)
  • if CPU and memory are degraded send a message like "CPU and Memory usages are close to their limits"
  • if only CPU or memory is degraded keep the current message

Creating two separate healthchecks is a different issue, though could make sense since healthchecks reports is a thing that would solve the aggregation issue (a breaking change if this one is removed instead). Also someone can already track CPU only or memory only with this component (by setting custom limits for each).

@RussKie
Copy link
Member

RussKie commented Aug 29, 2024

IMO the component should keep doing what it is meant for (resource utilization health checks) and the simplest fix would be to just send a compound message per health level:

  • if CPU and memory are unhealthy send a message like "CPU and Memory usages are above their limits"
  • if only CPU or memory is unhealthy keep the current message (independently of the other being in degraded status)
  • if CPU and memory are degraded send a message like "CPU and Memory usages are close to their limits"
  • if only CPU or memory is degraded keep the current message

Creating two separate healthchecks is a different issue, though could make sense since healthchecks reports is a thing that would solve the aggregation issue (a breaking change if this one is removed instead). Also someone can already track CPU only or memory only with this component (by setting custom limits for each).

@sebastienros, to clarify I understand your suggestion correctly, do you mean the following?

message = "";
if (cpu != healthy) { message += "CPU unhealthy" }
if (memory != healthy) { message += "memory unhealthy" }
return HealthCheckResult.Unhealthy(message);

@sebastienros
Copy link
Member

@RussKie something like this, yes.

An alternative could be to have the degraded message also added, but still return the highest of the levels. Example with CPU unhealthy and memory degraded:

CPU usage is above the limit. Memory is close to the limit, and return Unhealthy because one them at least is unhealthy. If both were degraded then Degraded would be returned.

That would also make the code easier. Just keep the current checks, but don't return, just append to the message. And in the end return the highest detected level.

@RussKie
Copy link
Member

RussKie commented Aug 29, 2024

As a short-term solution/workaround this is probably ok, but to me this still feels as conflating two (or more) checks into one, which feels like conflicting with the purpose of the IHealthCheck contract.

@sebastienros
Copy link
Member

@RussKie definitely.

I assume that the goal here was performance. Two healthchecks would mean two calls to _dataTracker.GetUtilization(_options.SamplingWindow); which might be expensive.

@willibrandon
Copy link
Contributor

A compound message in the description works for me and even better if additional key-value pairs describing the health of each component could be included in the data.

@github-actions github-actions bot locked and limited conversation to collaborators Oct 10, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-fundamentals bug This issue describes a behavior which is not expected - a bug. help wanted Up for grabs. We would accept a PR to help resolve this issue work in progress 🚧
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants