pageserver: circuit breakers #6738

jcsp · 2024-02-13T09:29:38Z

Examples of recent issues in other services that show up as storage issues:

Logical replication writes causing >1000x storage amplification, and cause the pageserver to exhaust disk
Too-fast retries on non-attached tenants saturate our ability to write log errors, and cause the pageserver to become unresponsive.
Corrupt pages cause compactions on damaged tenants to retry forever, emitting an endless stream of logs

While we of course fix these issues at the source, we can still make the pageserver more resilient.

Implementation:

Circuit breaker type has a "success()" and "fail()" function, a threshold for how many failures cause the circuit to break, and a duration after which the circuit should close again after breaking (may be none to stay open forever).
Tenant config property to override the circuit breaker, so that if a seriously damaged tenant's repair requires the pageserver to allow it to operate for some period.
Add circuit breakers in key locations:
- Track success/fail of getpage requests for a timeline, and on more than N consecutive errors, break the circuit such that we do no work (including not logging anything, and not taking TenantManager locks) for subsequent requests on that timeline's page_service.
- Track success/fail after compaction, and on more than N consecutive errors (e.g. 5), and breaking the circuit stops subsequent attempts at compaction for a long period (e.g. 1hr)
- Periodic check for pathological storage amplification, where one failure is enough to break the circuit, and breaking the circuit stops wal ingest

jcsp · 2024-02-13T09:37:26Z

Starting point PR that needs generalizing somewhat: #6734

The main gate for landing something like this is to have the tenant config mechanism to skip the checks, so that if we need to permit a misbehaving tenant to keep ingesting in order to get it to a fixed state, we can.

## Problem We already back off on compaction retries, but the impact of a failing compaction can be so great that backing off up to 300s isn't enough. The impact is consuming a lot of I/O+CPU in the case of image layer generation for large tenants, and potentially also leaking disk space. Compaction failures are extremely rare and almost always indicate a bug, frequently a bug that will not let compaction to proceed until it is fixed. Related: #6738 ## Summary of changes - Introduce a CircuitBreaker type - Add a circuit breaker for compaction, with a policy that after 5 failures, compaction will not be attempted again for 24 hours. - Add metrics that we can alert on: any >0 value for `pageserver_circuit_breaker_broken_total` should generate an alert. - Add a test that checks this works as intended. Couple notes to reviewers: - Circuit breakers are intrinsically a defense-in-depth measure: this is not the solution to any underlying issues, it is just a general mitigation for "unknown unknowns" that might be encountered in future. - This PR isn't primarily about writing a perfect CircuitBreaker type: the one in this PR is meant to be just enough to mitigate issues in compaction, and make it easy to monitor/alert on these failures. We can refine this type in future as/when we want to use it elsewhere.

## Problem We lack insight into: - How much of a tenant's physical size is image vs. delta layers - Average sizes of image vs. delta layers - Total layer counts per timeline, indicating size of index_part object As well as general observability love, this is motivated by #6738, where we need to define some sensible thresholds for storage amplification, and using total physical size may not work well (if someone does a lot of DROPs then it's legitimate for the physical-synthetic ratio to be huge), but the ratio between image layer size and delta layer size may be a better indicator of whether we're generating unreasonable quantities of image layers. ## Summary of changes - Add pageserver_layer_bytes and pageserver_layer_count metrics, labelled by timeline and `kind` (delta or image) - Add & subtract these with LayerInner's lifetime. I'm intentionally avoiding using a generic metric RAII guard object, to avoid bloating LayerInner: it already has all the information it needs to update metric on new+drop.

jcsp added c/storage/pageserver Component: storage: pageserver a/tech_debt Area: related to tech debt labels Feb 13, 2024

jcsp mentioned this issue Feb 13, 2024

pageserver: circuit breakers on pathological amplification, repeated compaction failures #6734

Closed

5 tasks

jcsp self-assigned this Apr 5, 2024

jcsp mentioned this issue Apr 26, 2024

Add an alert to trigger when difference between synthetic size and physical size is greater than 10x #3736

Closed

jcsp mentioned this issue Jul 11, 2024

pageserver: circuit breaker on compaction #8359

Merged

5 tasks

jcsp mentioned this issue Jul 17, 2024

pageserver: layer count & size metrics #8410

Merged

5 tasks

jcsp closed this as completed Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pageserver: circuit breakers #6738

pageserver: circuit breakers #6738

jcsp commented Feb 13, 2024

jcsp commented Feb 13, 2024

pageserver: circuit breakers #6738

pageserver: circuit breakers #6738

Comments

jcsp commented Feb 13, 2024

jcsp commented Feb 13, 2024