pageserver: circuit breakers on pathological amplification, repeated compaction failures #6734
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
See ticket: #6738
Under some circumstances, a misbehaving postgres can create problems for the pageserver:
To avoid impact on other tenants on the pageserver, and to help find issues sooner, the pageserver should take action against tenants that are misbehaving in these ways.
Summary of changes
enforce_circuit_breakers
tenant config. Defaulting to false for the first deployment of this. Later we will set it to true by default, and it will only be set to false during an incident if we actively want a particular tenant to be permitted to violate limits.CircuitBreaker
type for counting failures and disabling an operation after too many failuresCircuitBreaker
for tenant compaction, with a policy that after 5 failures we'll give up trying to compact for an hour.Why not use the
failsafe
crate? -- it uses a mutex internally, and when we use one of these circuit breakers on e.g. a page request path, that is too much overhead (though we use a mutex in the compaction case, as this is called infrequently)Checklist before requesting a review
Checklist before merging