ratelimits: Auto pause zombie clients #7763

kruti-s · 2024-10-22T21:42:58Z

Added a new key-value ratelimit FailedAuthorizationsForPausingPerDomainPerAccount which is incremented each time a client fails a validation.
- As long as capacity exists in the bucket, a successful validation attempt will reset the bucket back to full capacity.
- Upon exhausting bucket capacity, the RA will send a gRPC to the SA to pause the account:identifier. Further validation attempts will be rejected by the WFE.
Added a new feature flag, AutomaticallyPauseZombieClients, which enables automatic pausing of zombie clients in the RA.
Added a new RA metric paused_pairs{"paused":[bool], "repaused":[bool], "grace":[bool]} to monitor use of this new functionality.
Updated ra_test.go initAuthorities to allow accessing the *ratelimits.RedisSource for checking that the new ratelimit functions as intended.

Co-authored-by: @pgporada

…e-clients

aarongable

Generally looking good! Most of my comments are quite minor.

I do have one big overall question though: What do we expect to set the burst/count/period to in prod?

Fundamentally, this rate limit does not have the ability to measure the same thing that we measured by hand when doing the first rounds of manual pausing. It cannot measure "has this domain+acct been failing for X months". It can only measure "has this domain+acct failed X times in a row".

In order for the "reset when they succeed" mechanism to make sense, it seems like we need to have a very long period and a very high count. For example, if we were hoping to catch folks who fail 4 times per day, every day for six months, we could set the Period at 180d and the Burst and Count both at 720. But that also means that someone failing at the maximum rate allowed by the existing FailedAuthorizations limit (5 times per hour) would get paused after just 6 days. So what values do we intend to set to try to strike the right balance here?

ra/ra.go

ratelimits/names.go

aarongable · 2024-10-23T18:06:27Z

Okay, I found the discussion of burst/count/period in #7738, and I chatted with Samantha, and now I have a concrete proposal for how this limit should be configured:

As detailed in that bug, Burst should be 3600, reflecting our threshold of "40 failures per day across 90 days"
The Period should be 1 day
The Count should be 1, so that people essentially get one "free" failure per day

This maybe seems wild, but I think the math works out:

If someone fails the fastest they can possibly fail (5 times per hour, or 120 times per day, as enforced by the existing FailedAuthorizations limit), then they'll burn through the burst in 30 days, and burn through those 30 extra tokens in a few hours. 30 days is the fastest I'd want to pause anyone, I think.
If someone fails 40 times per day (i.e. right at our intended threshold), they'll get paused after ~92 days: 90 days to burn through their burst, and then a little bit more time to burn through the 90 tokens they've accumulated during that time.
If someone fails 30 times per day, just below our target threshold, they'll get paused after about 125 days, which makes sense to me.
If someone fails 4 times per day, they'll take 900 days to burn through their initial burst, during which time they'll earn 900 more tokens, which they'll burn through in 225 days, during which time... does some calculus they'll end up being paused after about 1200 days, or just over 3 years.
If someone fails once per day, they'll never be paused.

Those last two bullet points make it seem like we could set the count even lower (or equivalently, set the period higher), but I think "one freebie per day" is a reasonable place to start.

I think this math is highly unintuitive, and should be documented in the code near where the limit is defined.

beautifulentropy · 2024-10-23T21:09:35Z

I think this is perfectly reasonable. Here's a small table that expands on these estimates:

Failures/Day	Time to Pause
1/day	(Never paused)
2/day	3600.00 days (~118.27 months, ~9.86 years)
3/day	1800.00 days (~59.13 months, ~4.93 years)
4/day	1200.00 days (~39.42 months, ~3.29 years)
5/day	900.00 days (~29.57 months, ~2.46 years)
6/day	720.00 days (~23.65 months, ~1.97 years)
7/day	600.00 days (~19.71 months, ~1.64 years)
8/day	514.29 days (~16.90 months, ~1.41 years)
9/day	450.00 days (~14.78 months, ~1.23 years)
10/day	400.00 days (~13.14 months, ~1.10 years)
15/day	257.14 days (~8.45 months, ~0.70 years)
20/day	189.47 days (~6.22 months, ~0.52 years)
30/day	124.14 days (~4.08 months, ~0.34 years)
40/day	92.31 days (~3.03 months, ~0.25 years)
120/day	30.25 days (~0.99 months, ~0.08 years)

…o-pause-zombie-clients

…usingPerDomainPerAccount

…rAccount

ra/ra.go

test/config-next/wfe2-ratelimit-defaults.yml

ratelimits/testdata/working_override_13371338.yml

ra/ra.go

…o-pause-zombie-clients

ra/ra.go

…o-pause-zombie-clients

kruti-s · 2024-10-30T18:42:32Z

@aarongable tests not running. Issue at limiter.go line 163. Didn't finish writing TestResetAccountPausingLimit test.

pgporada · 2024-11-06T21:53:27Z

~~@letsencrypt/boulder-developers I am not sure why tests are failing, despite reading the error logs for each. I can't reproduce it locally.~~

pgporada · 2024-11-07T14:03:07Z

I figured it out. I did another merge main and found the two tests that were failing in the github actions CI, but didn't exist on my local branch.

pgporada · 2024-11-07T15:11:11Z

CPS Compliance Review:

CPS 4.1.1

Issuance depends on proper validation and compliance with ISRG policies.

As far as I can tell, this is the only document about our in-use ratelimits we have, but I don't think that's a formal policy.
CPS 4.2.2

The CA server rejects issuance requests for DNS identifiers that do not have a Public Suffix in the ICANN domains section.

We also reject based on ratelimits.

ra/ra.go

pgporada · 2024-11-08T14:53:11Z

Should we increment a metric each time an account is paused?

beautifulentropy

This is looking really clean. I've just got a few comments, mostly around log line formatting and some code that I think can go away.

features/features.go

ra/ra.go

ratelimits/limiter.go

ra/ra.go

ra/ra_test.go

jsha · 2024-11-08T21:26:22Z

features/features.go

+	// AutomaticallyPauseZombieClients configures the RA to automatically track
+	// limiter to be the authoritative source of rate limiting information for
+	// automatically pausing clients who systemically fail every validation
+	// attempt. When disabled, only manually paused accountID:identifier pairs


I can't parse this sentence. Specifically "to automatically track limiter to be the authoritative source ..." seems to be an editing error?

Weird this looks like some kind of hybrid of what was there and what I put in my suggestion: #7763 (comment)

jsha · 2024-11-08T21:31:39Z

ra/ra.go

 // countFailedValidation increments the failed authorizations per domain per
-// account rate limit. There is no reason to surface errors from this function
-// to the Subscriber, spends against this limit are best effort.
-func (ra *RegistrationAuthorityImpl) countFailedValidation(ctx context.Context, regId int64, name string) {
+// account rate limit. If the AutomaticallyPauseZombieClients feature has been
+// enabled, it also increments the failed authorizations for pausing per domain
+// per account rate limit. There is no reason to surface errors from this


// countFailedValidation increments the FailedAuthorizationsPerDomainPerAccount. If the AutomaticallyPauseZombieClients feature has been enabled, it also increments the FailedAuthorizationsForPausingPerDomainPerAccountTransaction rate limit

jsha · 2024-11-08T21:44:50Z

ra/ra.go

+	if features.Get().AutomaticallyPauseZombieClients {
+		txn, err = ra.txnBuilder.FailedAuthorizationsForPausingPerDomainPerAccountTransaction(regId, ident.Value)
+		if err != nil {
+			ra.log.Warningf("building rate limit transaction for the %s rate limit: %s", ratelimits.FailedAuthorizationsForPausingPerDomainPerAccount, err)


I realize this is a copy of some existing code, but I think both line 1833 and this line should instead return an error. I know this function tries to avoid returning errors to the caller, but a failure to build the rate limit transaction represents some sort of internal logic error, and that should become a 500 (which helps ensure it shows up in our metrics, and gets logged in the WFE with some useful context).

Also, now that there are two places within this function where we look for and discard Canceled / DeadlineExceeded errors, it makes more sense to return those to the caller as well, and have the caller look for Canceled / DeadlineExceeded (so we won't have to duplicate logic as much).

jsha · 2024-11-08T21:47:24Z

ra/ra.go

+				},
+			})
+			if err != nil {
+				ra.log.Warningf("failed to pause %d/%q: %s", regId, ident.Value, err)


Here's another place where we should simply return the error, and let the caller filter out Canceled / DeadlineExceeded if it wants to.

jsha · 2024-11-08T21:48:51Z

ra/ra.go

@@ -241,6 +242,12 @@ func NewRegistrationAuthorityImpl(
 	})
 	stats.MustRegister(certCSRMismatch)

+	pauseCounter := prometheus.NewCounterVec(prometheus.CounterOpts{
+		Name: "paused_pairs",
+		Help: "Number of times a pause operation is performed, labeled by paused=[bool], repaused=[bool], grace=[bool]",


Perhaps it's too much for the help string here, but it would be good to document what repaused and grace mean. Perhaps in a comment either here or where they are incremented?

Alternately, we could remove paused=[bool], repaused=[bool], grace=[bool] from the help string since that information is available direct from Prometheus, and use some of the saved space to explain the two less obvious labels.

Return an error and do logging in the caller. This adds early returns on a number of error conditions, which can prevent nil pointer dereference in those cases. Also update the description for AutomaticallyPauseZombieClients. Follows up #7763.

kruti-s added 7 commits October 15, 2024 17:07

edited names, bucket, overrides, etc

d122f18

more changes made

e966a9a

change error message for limiter

d6fed53

edit limiter_test for err msg

fb3a966

Increment ratelimit for IssuancePaused in ra.go

bdc2a13

SpendorCheck to SpendAndCheck

5bd14ab

Merge remote-tracking branch 'origin/main' into 7738-auto-pause-zombi…

2a02e4d

…e-clients

kruti-s requested a review from a team as a code owner October 22, 2024 21:42

kruti-s requested a review from aarongable October 22, 2024 21:42

aarongable reviewed Oct 23, 2024

View reviewed changes

kruti-s added 4 commits October 24, 2024 16:13

addressed comments for ra.go

03c3e71

Merge remote-tracking branch 'refs/remotes/origin/main' into 7738-aut…

73712f0

…o-pause-zombie-clients

change IssuancePausedPerDomainPerAccount to FailedAuthorizationsForPa…

d860f88

…usingPerDomainPerAccount

change count and period for FailedAuthorizationsForPausingPerDomainPe…

542b7c5

…rAccount

aarongable reviewed Oct 24, 2024

View reviewed changes

kruti-s added 2 commits October 25, 2024 14:49

Merge remote-tracking branch 'refs/remotes/origin/main' into 7738-aut…

7857899

…o-pause-zombie-clients

override and default explanation changes

35a6f63

pgporada reviewed Oct 28, 2024

View reviewed changes

ra/ra.go Show resolved Hide resolved

kruti-s added 2 commits October 29, 2024 12:00

change working override test vals

6c4e2fe

Merge remote-tracking branch 'refs/remotes/origin/main' into 7738-aut…

97aebfc

…o-pause-zombie-clients

kruti-s force-pushed the 7738-auto-pause-zombie-clients branch from 3687ef5 to 97aebfc Compare October 30, 2024 18:33

kruti-s and others added 6 commits October 30, 2024 16:25

limiter test fixes

061f498

Fix test

a5ce802

started writing TesetResetAccountPausingLimit in ra_test.go

8a28d9c

Progress

a207b74

More progress

f983373

Progress is progressing

49b959f

Merge branch 'main' into 7738-auto-pause-zombie-clients

a79d881

pgporada added 2 commits November 7, 2024 09:03

Fix tests from merging main

cc01553

Cleanup comments

fb547cd

pgporada requested a review from a team November 7, 2024 14:47

aarongable previously approved these changes Nov 8, 2024

View reviewed changes

ra/ra.go Outdated Show resolved Hide resolved

jprenken previously approved these changes Nov 8, 2024

View reviewed changes

Address comment

30ea374

pgporada dismissed stale reviews from jprenken and aarongable via 30ea374 November 8, 2024 14:48

pgporada requested review from pgporada, aarongable and jprenken November 8, 2024 14:48

beautifulentropy requested changes Nov 8, 2024

View reviewed changes

Address comments

2e65dc3

pgporada requested a review from beautifulentropy November 8, 2024 19:30

beautifulentropy requested changes Nov 8, 2024

View reviewed changes

Address next round of comments

4c7be8c

pgporada requested a review from beautifulentropy November 8, 2024 20:33

beautifulentropy approved these changes Nov 8, 2024

View reviewed changes

aarongable approved these changes Nov 8, 2024

View reviewed changes

jprenken approved these changes Nov 8, 2024

View reviewed changes

jprenken merged commit a79a830 into main Nov 8, 2024
14 checks passed

jprenken deleted the 7738-auto-pause-zombie-clients branch November 8, 2024 21:51

jsha reviewed Nov 8, 2024

View reviewed changes

jsha mentioned this pull request Nov 9, 2024

ra: clean up countFailedValidations #7797

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ratelimits: Auto pause zombie clients #7763

ratelimits: Auto pause zombie clients #7763

kruti-s commented Oct 22, 2024 •

edited by pgporada

Loading

aarongable left a comment

aarongable commented Oct 23, 2024 •

edited

Loading

beautifulentropy commented Oct 23, 2024 •

edited by aarongable

Loading

kruti-s commented Oct 30, 2024

pgporada commented Nov 6, 2024 •

edited

Loading

pgporada commented Nov 7, 2024

pgporada commented Nov 7, 2024 •

edited

Loading

pgporada commented Nov 8, 2024

beautifulentropy left a comment

jsha Nov 8, 2024

beautifulentropy Nov 8, 2024

jsha Nov 8, 2024

jsha Nov 8, 2024

jsha Nov 8, 2024

jsha Nov 8, 2024

ratelimits: Auto pause zombie clients #7763

ratelimits: Auto pause zombie clients #7763

Conversation

kruti-s commented Oct 22, 2024 • edited by pgporada Loading

aarongable left a comment

Choose a reason for hiding this comment

aarongable commented Oct 23, 2024 • edited Loading

beautifulentropy commented Oct 23, 2024 • edited by aarongable Loading

kruti-s commented Oct 30, 2024

pgporada commented Nov 6, 2024 • edited Loading

pgporada commented Nov 7, 2024

pgporada commented Nov 7, 2024 • edited Loading

pgporada commented Nov 8, 2024

beautifulentropy left a comment

Choose a reason for hiding this comment

jsha Nov 8, 2024

Choose a reason for hiding this comment

beautifulentropy Nov 8, 2024

Choose a reason for hiding this comment

jsha Nov 8, 2024

Choose a reason for hiding this comment

jsha Nov 8, 2024

Choose a reason for hiding this comment

jsha Nov 8, 2024

Choose a reason for hiding this comment

jsha Nov 8, 2024

Choose a reason for hiding this comment

kruti-s commented Oct 22, 2024 •

edited by pgporada

Loading

aarongable commented Oct 23, 2024 •

edited

Loading

beautifulentropy commented Oct 23, 2024 •

edited by aarongable

Loading

pgporada commented Nov 6, 2024 •

edited

Loading

pgporada commented Nov 7, 2024 •

edited

Loading