Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: circuit breakers on pathological amplification, repeated compaction failures #6734

Closed
wants to merge 4 commits into from

Conversation

jcsp
Copy link
Collaborator

@jcsp jcsp commented Feb 12, 2024

Problem

See ticket: #6738

Under some circumstances, a misbehaving postgres can create problems for the pageserver:

  • Writing corrupt data can cause pageserver to emit endless warnings when trying to compact
  • Writing WAL contents that generate pathologically large physical state.

To avoid impact on other tenants on the pageserver, and to help find issues sooner, the pageserver should take action against tenants that are misbehaving in these ways.

Summary of changes

  • Add enforce_circuit_breakers tenant config. Defaulting to false for the first deployment of this. Later we will set it to true by default, and it will only be set to false during an incident if we actively want a particular tenant to be permitted to violate limits.
  • During tenant compaction, check storage amplification and halt WAL ingress if it violates a threshold (1000x)
  • Add a CircuitBreaker type for counting failures and disabling an operation after too many failures
  • Use a CircuitBreaker for tenant compaction, with a policy that after 5 failures we'll give up trying to compact for an hour.

Why not use the failsafe crate? -- it uses a mutex internally, and when we use one of these circuit breakers on e.g. a page request path, that is too much overhead (though we use a mutex in the compaction case, as this is called infrequently)

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@jcsp jcsp added t/bug Issue Type: Bug c/storage/pageserver Component: storage: pageserver t/on_call_followup labels Feb 12, 2024
Copy link

github-actions bot commented Feb 12, 2024

2436 tests run: 2318 passed, 0 failed, 118 skipped (full report)


Flaky tests (2)

Postgres 15

  • test_crafted_wal_end[last_wal_record_crossing_segment]: release

Postgres 14

  • test_sharding_split_unsharded: release

Code coverage (full report)

  • functions: 56.1% (12824 of 22852 functions)
  • lines: 82.6% (69246 of 83835 lines)

The comment gets automatically updated with the latest test results
2183290 at 2024-02-13T15:04:02.949Z :recycle:

@jcsp jcsp changed the title pageserver: shut down WAL ingest if pathological storage amplificatio… pageserver: circuit breakers on pathological amplification, repeated compaction failures Feb 13, 2024
@jcsp jcsp force-pushed the jcsp/pageserver-self-defense branch from 49e9b35 to 2183290 Compare February 13, 2024 14:22
@jcsp
Copy link
Collaborator Author

jcsp commented Jul 17, 2024

The compaction failure circuit breaker was added in another PR. The pathological amplification check in this PR is wrong, it would fire on tenants that do a DROP on a big DB because physical size lags synthetic size.

@jcsp jcsp closed this Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug t/on_call_followup
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant