Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ratelimits: Auto pause zombie clients #7763

Merged
merged 35 commits into from
Nov 8, 2024
Merged

Conversation

kruti-s
Copy link
Contributor

@kruti-s kruti-s commented Oct 22, 2024

  • Added a new key-value ratelimit FailedAuthorizationsForPausingPerDomainPerAccount which is incremented each time a client fails a validation.
    • As long as capacity exists in the bucket, a successful validation attempt will reset the bucket back to full capacity.
    • Upon exhausting bucket capacity, the RA will send a gRPC to the SA to pause the account:identifier. Further validation attempts will be rejected by the WFE.
  • Added a new feature flag, AutomaticallyPauseZombieClients, which enables automatic pausing of zombie clients in the RA.
  • Added a new RA metric paused_pairs{"paused":[bool], "repaused":[bool], "grace":[bool]} to monitor use of this new functionality.
  • Updated ra_test.go initAuthorities to allow accessing the *ratelimits.RedisSource for checking that the new ratelimit functions as intended.

Co-authored-by: @pgporada

Fixes #7738

@kruti-s kruti-s requested a review from a team as a code owner October 22, 2024 21:42
Copy link
Contributor

@aarongable aarongable left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looking good! Most of my comments are quite minor.

I do have one big overall question though: What do we expect to set the burst/count/period to in prod?

Fundamentally, this rate limit does not have the ability to measure the same thing that we measured by hand when doing the first rounds of manual pausing. It cannot measure "has this domain+acct been failing for X months". It can only measure "has this domain+acct failed X times in a row".

In order for the "reset when they succeed" mechanism to make sense, it seems like we need to have a very long period and a very high count. For example, if we were hoping to catch folks who fail 4 times per day, every day for six months, we could set the Period at 180d and the Burst and Count both at 720. But that also means that someone failing at the maximum rate allowed by the existing FailedAuthorizations limit (5 times per hour) would get paused after just 6 days. So what values do we intend to set to try to strike the right balance here?

ra/ra.go Outdated Show resolved Hide resolved
ra/ra.go Outdated Show resolved Hide resolved
ra/ra.go Outdated Show resolved Hide resolved
ra/ra.go Outdated Show resolved Hide resolved
ra/ra.go Outdated Show resolved Hide resolved
ratelimits/names.go Outdated Show resolved Hide resolved
ratelimits/names.go Outdated Show resolved Hide resolved
@aarongable
Copy link
Contributor

aarongable commented Oct 23, 2024

Okay, I found the discussion of burst/count/period in #7738, and I chatted with Samantha, and now I have a concrete proposal for how this limit should be configured:

  • As detailed in that bug, Burst should be 3600, reflecting our threshold of "40 failures per day across 90 days"
  • The Period should be 1 day
  • The Count should be 1, so that people essentially get one "free" failure per day

This maybe seems wild, but I think the math works out:

  • If someone fails the fastest they can possibly fail (5 times per hour, or 120 times per day, as enforced by the existing FailedAuthorizations limit), then they'll burn through the burst in 30 days, and burn through those 30 extra tokens in a few hours. 30 days is the fastest I'd want to pause anyone, I think.
  • If someone fails 40 times per day (i.e. right at our intended threshold), they'll get paused after ~92 days: 90 days to burn through their burst, and then a little bit more time to burn through the 90 tokens they've accumulated during that time.
  • If someone fails 30 times per day, just below our target threshold, they'll get paused after about 125 days, which makes sense to me.
  • If someone fails 4 times per day, they'll take 900 days to burn through their initial burst, during which time they'll earn 900 more tokens, which they'll burn through in 225 days, during which time... does some calculus they'll end up being paused after about 1200 days, or just over 3 years.
  • If someone fails once per day, they'll never be paused.

Those last two bullet points make it seem like we could set the count even lower (or equivalently, set the period higher), but I think "one freebie per day" is a reasonable place to start.

I think this math is highly unintuitive, and should be documented in the code near where the limit is defined.

@beautifulentropy
Copy link
Member

beautifulentropy commented Oct 23, 2024

I think this is perfectly reasonable. Here's a small table that expands on these estimates:

Failures/Day Time to Pause
1/day (Never paused)
2/day 3600.00 days (~118.27 months, ~9.86 years)
3/day 1800.00 days (~59.13 months, ~4.93 years)
4/day 1200.00 days (~39.42 months, ~3.29 years)
5/day 900.00 days (~29.57 months, ~2.46 years)
6/day 720.00 days (~23.65 months, ~1.97 years)
7/day 600.00 days (~19.71 months, ~1.64 years)
8/day 514.29 days (~16.90 months, ~1.41 years)
9/day 450.00 days (~14.78 months, ~1.23 years)
10/day 400.00 days (~13.14 months, ~1.10 years)
15/day 257.14 days (~8.45 months, ~0.70 years)
20/day 189.47 days (~6.22 months, ~0.52 years)
30/day 124.14 days (~4.08 months, ~0.34 years)
40/day 92.31 days (~3.03 months, ~0.25 years)
120/day 30.25 days (~0.99 months, ~0.08 years)

ra/ra.go Outdated Show resolved Hide resolved
ra/ra.go Outdated Show resolved Hide resolved
test/config-next/wfe2-ratelimit-defaults.yml Outdated Show resolved Hide resolved
ratelimits/testdata/working_override_13371338.yml Outdated Show resolved Hide resolved
ra/ra.go Show resolved Hide resolved
ra/ra.go Show resolved Hide resolved
@kruti-s kruti-s force-pushed the 7738-auto-pause-zombie-clients branch from 3687ef5 to 97aebfc Compare October 30, 2024 18:33
@kruti-s
Copy link
Contributor Author

kruti-s commented Oct 30, 2024

@aarongable tests not running. Issue at limiter.go line 163. Didn't finish writing TestResetAccountPausingLimit test.

@pgporada
Copy link
Member

pgporada commented Nov 6, 2024

@letsencrypt/boulder-developers I am not sure why tests are failing, despite reading the error logs for each. I can't reproduce it locally.

@pgporada
Copy link
Member

pgporada commented Nov 7, 2024

I figured it out. I did another merge main and found the two tests that were failing in the github actions CI, but didn't exist on my local branch.

@pgporada pgporada requested a review from a team November 7, 2024 14:47
@pgporada
Copy link
Member

pgporada commented Nov 7, 2024

CPS Compliance Review:

  • CPS 4.1.1

    Issuance depends on proper validation and compliance with ISRG policies.

    As far as I can tell, this is the only document about our in-use ratelimits we have, but I don't think that's a formal policy.

  • CPS 4.2.2

    The CA server rejects issuance requests for DNS identifiers that do not have a Public Suffix in the ICANN domains section.

    We also reject based on ratelimits.

aarongable
aarongable previously approved these changes Nov 8, 2024
ra/ra.go Outdated Show resolved Hide resolved
jprenken
jprenken previously approved these changes Nov 8, 2024
@pgporada
Copy link
Member

pgporada commented Nov 8, 2024

Should we increment a metric each time an account is paused?

Copy link
Member

@beautifulentropy beautifulentropy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really clean. I've just got a few comments, mostly around log line formatting and some code that I think can go away.

features/features.go Outdated Show resolved Hide resolved
ra/ra.go Outdated Show resolved Hide resolved
ra/ra.go Outdated Show resolved Hide resolved
ratelimits/limiter.go Outdated Show resolved Hide resolved
ra/ra.go Outdated Show resolved Hide resolved
ra/ra.go Outdated Show resolved Hide resolved
ra/ra_test.go Outdated Show resolved Hide resolved
ra/ra_test.go Show resolved Hide resolved
ra/ra_test.go Outdated Show resolved Hide resolved
ra/ra_test.go Outdated Show resolved Hide resolved
ra/ra_test.go Outdated Show resolved Hide resolved
ra/ra_test.go Outdated Show resolved Hide resolved
@jprenken jprenken merged commit a79a830 into main Nov 8, 2024
14 checks passed
@jprenken jprenken deleted the 7738-auto-pause-zombie-clients branch November 8, 2024 21:51
Comment on lines +122 to +125
// AutomaticallyPauseZombieClients configures the RA to automatically track
// limiter to be the authoritative source of rate limiting information for
// automatically pausing clients who systemically fail every validation
// attempt. When disabled, only manually paused accountID:identifier pairs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't parse this sentence. Specifically "to automatically track limiter to be the authoritative source ..." seems to be an editing error?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird this looks like some kind of hybrid of what was there and what I put in my suggestion: #7763 (comment)

Comment on lines 1820 to +1823
// countFailedValidation increments the failed authorizations per domain per
// account rate limit. There is no reason to surface errors from this function
// to the Subscriber, spends against this limit are best effort.
func (ra *RegistrationAuthorityImpl) countFailedValidation(ctx context.Context, regId int64, name string) {
// account rate limit. If the AutomaticallyPauseZombieClients feature has been
// enabled, it also increments the failed authorizations for pausing per domain
// per account rate limit. There is no reason to surface errors from this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// countFailedValidation increments the FailedAuthorizationsPerDomainPerAccount. If the AutomaticallyPauseZombieClients feature has been enabled, it also increments the FailedAuthorizationsForPausingPerDomainPerAccountTransaction rate limit

if features.Get().AutomaticallyPauseZombieClients {
txn, err = ra.txnBuilder.FailedAuthorizationsForPausingPerDomainPerAccountTransaction(regId, ident.Value)
if err != nil {
ra.log.Warningf("building rate limit transaction for the %s rate limit: %s", ratelimits.FailedAuthorizationsForPausingPerDomainPerAccount, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this is a copy of some existing code, but I think both line 1833 and this line should instead return an error. I know this function tries to avoid returning errors to the caller, but a failure to build the rate limit transaction represents some sort of internal logic error, and that should become a 500 (which helps ensure it shows up in our metrics, and gets logged in the WFE with some useful context).

Also, now that there are two places within this function where we look for and discard Canceled / DeadlineExceeded errors, it makes more sense to return those to the caller as well, and have the caller look for Canceled / DeadlineExceeded (so we won't have to duplicate logic as much).

},
})
if err != nil {
ra.log.Warningf("failed to pause %d/%q: %s", regId, ident.Value, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's another place where we should simply return the error, and let the caller filter out Canceled / DeadlineExceeded if it wants to.

@@ -241,6 +242,12 @@ func NewRegistrationAuthorityImpl(
})
stats.MustRegister(certCSRMismatch)

pauseCounter := prometheus.NewCounterVec(prometheus.CounterOpts{
Name: "paused_pairs",
Help: "Number of times a pause operation is performed, labeled by paused=[bool], repaused=[bool], grace=[bool]",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it's too much for the help string here, but it would be good to document what repaused and grace mean. Perhaps in a comment either here or where they are incremented?

Alternately, we could remove paused=[bool], repaused=[bool], grace=[bool] from the help string since that information is available direct from Prometheus, and use some of the saved space to explain the two less obvious labels.

beautifulentropy pushed a commit that referenced this pull request Nov 14, 2024
Return an error and do logging in the caller. This adds early returns on
a number of error conditions, which can prevent nil pointer dereference
in those cases.

Also update the description for AutomaticallyPauseZombieClients.

Follows up #7763.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Automatically Pause Zombie Clients
6 participants