Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: circuit breakers #6738

Closed
jcsp opened this issue Feb 13, 2024 · 1 comment
Closed

pageserver: circuit breakers #6738

jcsp opened this issue Feb 13, 2024 · 1 comment
Assignees
Labels
a/tech_debt Area: related to tech debt c/storage/pageserver Component: storage: pageserver

Comments

@jcsp
Copy link
Collaborator

jcsp commented Feb 13, 2024

Examples of recent issues in other services that show up as storage issues:

  • Logical replication writes causing >1000x storage amplification, and cause the pageserver to exhaust disk
  • Too-fast retries on non-attached tenants saturate our ability to write log errors, and cause the pageserver to become unresponsive.
  • Corrupt pages cause compactions on damaged tenants to retry forever, emitting an endless stream of logs

While we of course fix these issues at the source, we can still make the pageserver more resilient.

Implementation:

  • Circuit breaker type has a "success()" and "fail()" function, a threshold for how many failures cause the circuit to break, and a duration after which the circuit should close again after breaking (may be none to stay open forever).
  • Tenant config property to override the circuit breaker, so that if a seriously damaged tenant's repair requires the pageserver to allow it to operate for some period.
  • Add circuit breakers in key locations:
    • Track success/fail of getpage requests for a timeline, and on more than N consecutive errors, break the circuit such that we do no work (including not logging anything, and not taking TenantManager locks) for subsequent requests on that timeline's page_service.
    • Track success/fail after compaction, and on more than N consecutive errors (e.g. 5), and breaking the circuit stops subsequent attempts at compaction for a long period (e.g. 1hr)
    • Periodic check for pathological storage amplification, where one failure is enough to break the circuit, and breaking the circuit stops wal ingest
@jcsp jcsp added c/storage/pageserver Component: storage: pageserver a/tech_debt Area: related to tech debt labels Feb 13, 2024
@jcsp
Copy link
Collaborator Author

jcsp commented Feb 13, 2024

Starting point PR that needs generalizing somewhat: #6734

The main gate for landing something like this is to have the tenant config mechanism to skip the checks, so that if we need to permit a misbehaving tenant to keep ingesting in order to get it to a fixed state, we can.

@jcsp jcsp self-assigned this Apr 5, 2024
jcsp added a commit that referenced this issue Jul 12, 2024
## Problem

We already back off on compaction retries, but the impact of a failing
compaction can be so great that backing off up to 300s isn't enough. The
impact is consuming a lot of I/O+CPU in the case of image layer
generation for large tenants, and potentially also leaking disk space.

Compaction failures are extremely rare and almost always indicate a bug,
frequently a bug that will not let compaction to proceed until it is
fixed.

Related: #6738

## Summary of changes

- Introduce a CircuitBreaker type
- Add a circuit breaker for compaction, with a policy that after 5
failures, compaction will not be attempted again for 24 hours.
- Add metrics that we can alert on: any >0 value for
`pageserver_circuit_breaker_broken_total` should generate an alert.
- Add a test that checks this works as intended.

Couple notes to reviewers:
- Circuit breakers are intrinsically a defense-in-depth measure: this is
not the solution to any underlying issues, it is just a general
mitigation for "unknown unknowns" that might be encountered in future.
- This PR isn't primarily about writing a perfect CircuitBreaker type:
the one in this PR is meant to be just enough to mitigate issues in
compaction, and make it easy to monitor/alert on these failures. We can
refine this type in future as/when we want to use it elsewhere.
skyzh pushed a commit that referenced this issue Jul 15, 2024
## Problem

We already back off on compaction retries, but the impact of a failing
compaction can be so great that backing off up to 300s isn't enough. The
impact is consuming a lot of I/O+CPU in the case of image layer
generation for large tenants, and potentially also leaking disk space.

Compaction failures are extremely rare and almost always indicate a bug,
frequently a bug that will not let compaction to proceed until it is
fixed.

Related: #6738

## Summary of changes

- Introduce a CircuitBreaker type
- Add a circuit breaker for compaction, with a policy that after 5
failures, compaction will not be attempted again for 24 hours.
- Add metrics that we can alert on: any >0 value for
`pageserver_circuit_breaker_broken_total` should generate an alert.
- Add a test that checks this works as intended.

Couple notes to reviewers:
- Circuit breakers are intrinsically a defense-in-depth measure: this is
not the solution to any underlying issues, it is just a general
mitigation for "unknown unknowns" that might be encountered in future.
- This PR isn't primarily about writing a perfect CircuitBreaker type:
the one in this PR is meant to be just enough to mitigate issues in
compaction, and make it easy to monitor/alert on these failures. We can
refine this type in future as/when we want to use it elsewhere.
jcsp added a commit that referenced this issue Jul 17, 2024
## Problem

We lack insight into:
- How much of a tenant's physical size is image vs. delta layers
- Average sizes of image vs. delta layers
- Total layer counts per timeline, indicating size of index_part object

As well as general observability love, this is motivated by
#6738, where we need to
define some sensible thresholds for storage amplification, and using
total physical size may not work well (if someone does a lot of DROPs
then it's legitimate for the physical-synthetic ratio to be huge), but
the ratio between image layer size and delta layer size may be a better
indicator of whether we're generating unreasonable quantities of image
layers.

## Summary of changes

- Add pageserver_layer_bytes and pageserver_layer_count metrics,
labelled by timeline and `kind` (delta or image)
- Add & subtract these with LayerInner's lifetime.

I'm intentionally avoiding using a generic metric RAII guard object, to
avoid bloating LayerInner: it already has all the information it needs
to update metric on new+drop.
problame pushed a commit that referenced this issue Jul 22, 2024
## Problem

We lack insight into:
- How much of a tenant's physical size is image vs. delta layers
- Average sizes of image vs. delta layers
- Total layer counts per timeline, indicating size of index_part object

As well as general observability love, this is motivated by
#6738, where we need to
define some sensible thresholds for storage amplification, and using
total physical size may not work well (if someone does a lot of DROPs
then it's legitimate for the physical-synthetic ratio to be huge), but
the ratio between image layer size and delta layer size may be a better
indicator of whether we're generating unreasonable quantities of image
layers.

## Summary of changes

- Add pageserver_layer_bytes and pageserver_layer_count metrics,
labelled by timeline and `kind` (delta or image)
- Add & subtract these with LayerInner's lifetime.

I'm intentionally avoiding using a generic metric RAII guard object, to
avoid bloating LayerInner: it already has all the information it needs
to update metric on new+drop.
@jcsp jcsp closed this as completed Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/tech_debt Area: related to tech debt c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

No branches or pull requests

1 participant