Rate-limit new prebuilds to 50 / user / minute to prevent incidents #8504

jankeromnes · 2022-03-01T12:54:58Z

Description

Rate-limit new prebuilds to 50 / user / minute to prevent incidents

Related Issue(s)

How to test

Fork https://gitlab.com/jankeromnes/test-gitpod-prebuilds-rate-limit
Add your fork as a project here: https://jx-rt-lmt-prblds.staging.gitpod-dev.com/new?team=user
Open your fork in Gitpod (any Gitpod) and run ./push-100.sh
Once all 100 commits have been pushed (20 commits to 5 branches), verify that:
- most prebuilds actually started (but not all 100)
- some prebuilds never started but got cancelled instead (between 0 and 50, depending on when the burst happened in the 1min rate-limit window)
- kubectl logs server-[TAB] server | grep -i "rate limit" contains some rate limit warnings (as many as the cancelled prebuilds)

Release Notes

Rate-limit new prebuilds to 50 / user / minute to prevent incidents

Documentation

…s & webhooks)

codecov · 2022-03-01T13:03:13Z

Codecov Report

Merging #8504 (35e97c1) into main (6db0f39) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##             main    #8504   +/-   ##
=======================================
  Coverage   11.17%   11.17%           
=======================================
  Files          18       18           
  Lines         993      993           
=======================================
  Hits          111      111           
  Misses        880      880           
  Partials        2        2

Flag	Coverage Δ
components-gitpod-cli-app	`11.17% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 362f860...35e97c1. Read the comment docs.

jankeromnes · 2022-03-01T13:10:40Z

Re-roll 🎲

/werft run

👍 started the job as gitpod-build-jx-rt-lmt-prblds.1

jankeromnes · 2022-03-01T14:42:09Z

Expected

100 commits pushed (20 to 5 branches)
50 prebuilds started (state = available)
50 prebuilds cancelled (state = aborted)
50 rate limit warnings

Current state

100 commits pushed (20 to 5 branches)
87 prebuilds started (state = available)
13 prebuilds "queued"? 🤔 (state = queued)

13 rate limit warnings

{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","serviceContext":{"service":"server","version":"jx-rt-lmt-prblds.1"},"component":"server","severity":"WARNING","time":"2022-03-01T14:39:03.082Z","context":{"userId":"de62eb36-a3ce-4299-8282-72bb58485905"},"message":"Rate limiter prevents accessing method due to too many requests.","payload":[{"remainingPoints":0,"msBeforeNext":8499,"consumedPoints":51,"isFirstInDuration":false},{"method":"startPrebuild"}]}
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","serviceContext":{"service":"server","version":"jx-rt-lmt-prblds.1"},"component":"server","severity":"WARNING","time":"2022-03-01T14:39:03.085Z","context":{"userId":"de62eb36-a3ce-4299-8282-72bb58485905"},"message":"Rate limiter prevents accessing method due to too many requests.","payload":[{"remainingPoints":0,"msBeforeNext":8495,"consumedPoints":52,"isFirstInDuration":false},{"method":"startPrebuild"}]}
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","serviceContext":{"service":"server","version":"jx-rt-lmt-prblds.1"},"component":"server","severity":"WARNING","time":"2022-03-01T14:39:03.095Z","context":{"userId":"de62eb36-a3ce-4299-8282-72bb58485905"},"message":"Rate limiter prevents accessing method due to too many requests.","payload":[{"remainingPoints":0,"msBeforeNext":8485,"consumedPoints":53,"isFirstInDuration":false},{"method":"startPrebuild"}]}
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","serviceContext":{"service":"server","version":"jx-rt-lmt-prblds.1"},"component":"server","severity":"WARNING","time":"2022-03-01T14:39:03.098Z","context":{"userId":"de62eb36-a3ce-4299-8282-72bb58485905"},"message":"Rate limiter prevents accessing method due to too many requests.","payload":[{"remainingPoints":0,"msBeforeNext":8483,"consumedPoints":54,"isFirstInDuration":false},{"method":"startPrebuild"}]}
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","serviceContext":{"service":"server","version":"jx-rt-lmt-prblds.1"},"component":"server","severity":"WARNING","time":"2022-03-01T14:39:03.127Z","context":{"userId":"de62eb36-a3ce-4299-8282-72bb58485905"},"message":"Rate limiter prevents accessing method due to too many requests.","payload":[{"remainingPoints":0,"msBeforeNext":8453,"consumedPoints":55,"isFirstInDuration":false},{"method":"startPrebuild"}]}
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","serviceContext":{"service":"server","version":"jx-rt-lmt-prblds.1"},"component":"server","severity":"WARNING","time":"2022-03-01T14:39:03.147Z","context":{"userId":"de62eb36-a3ce-4299-8282-72bb58485905"},"message":"Rate limiter prevents accessing method due to too many requests.","payload":[{"remainingPoints":0,"msBeforeNext":8434,"consumedPoints":56,"isFirstInDuration":false},{"method":"startPrebuild"}]}
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","serviceContext":{"service":"server","version":"jx-rt-lmt-prblds.1"},"component":"server","severity":"WARNING","time":"2022-03-01T14:39:03.169Z","context":{"userId":"de62eb36-a3ce-4299-8282-72bb58485905"},"message":"Rate limiter prevents accessing method due to too many requests.","payload":[{"remainingPoints":0,"msBeforeNext":8411,"consumedPoints":57,"isFirstInDuration":false},{"method":"startPrebuild"}]}
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","serviceContext":{"service":"server","version":"jx-rt-lmt-prblds.1"},"component":"server","severity":"WARNING","time":"2022-03-01T14:39:03.185Z","context":{"userId":"de62eb36-a3ce-4299-8282-72bb58485905"},"message":"Rate limiter prevents accessing method due to too many requests.","payload":[{"remainingPoints":0,"msBeforeNext":8395,"consumedPoints":58,"isFirstInDuration":false},{"method":"startPrebuild"}]}
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","serviceContext":{"service":"server","version":"jx-rt-lmt-prblds.1"},"component":"server","severity":"WARNING","time":"2022-03-01T14:39:03.194Z","context":{"userId":"de62eb36-a3ce-4299-8282-72bb58485905"},"message":"Rate limiter prevents accessing method due to too many requests.","payload":[{"remainingPoints":0,"msBeforeNext":8386,"consumedPoints":59,"isFirstInDuration":false},{"method":"startPrebuild"}]}
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","serviceContext":{"service":"server","version":"jx-rt-lmt-prblds.1"},"component":"server","severity":"WARNING","time":"2022-03-01T14:39:03.196Z","context":{"userId":"de62eb36-a3ce-4299-8282-72bb58485905"},"message":"Rate limiter prevents accessing method due to too many requests.","payload":[{"remainingPoints":0,"msBeforeNext":8385,"consumedPoints":60,"isFirstInDuration":false},{"method":"startPrebuild"}]}
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","serviceContext":{"service":"server","version":"jx-rt-lmt-prblds.1"},"component":"server","severity":"WARNING","time":"2022-03-01T14:39:03.201Z","context":{"userId":"de62eb36-a3ce-4299-8282-72bb58485905"},"message":"Rate limiter prevents accessing method due to too many requests.","payload":[{"remainingPoints":0,"msBeforeNext":8379,"consumedPoints":61,"isFirstInDuration":false},{"method":"startPrebuild"}]}
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","serviceContext":{"service":"server","version":"jx-rt-lmt-prblds.1"},"component":"server","severity":"WARNING","time":"2022-03-01T14:39:03.210Z","context":{"userId":"de62eb36-a3ce-4299-8282-72bb58485905"},"message":"Rate limiter prevents accessing method due to too many requests.","payload":[{"remainingPoints":0,"msBeforeNext":8371,"consumedPoints":62,"isFirstInDuration":false},{"method":"startPrebuild"}]}
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","serviceContext":{"service":"server","version":"jx-rt-lmt-prblds.1"},"component":"server","severity":"WARNING","time":"2022-03-01T14:39:03.245Z","context":{"userId":"de62eb36-a3ce-4299-8282-72bb58485905"},"message":"Rate limiter prevents accessing method due to too many requests.","payload":[{"remainingPoints":0,"msBeforeNext":8335,"consumedPoints":63,"isFirstInDuration":false},{"method":"startPrebuild"}]}

Something seems off here 🤔

EDIT:

our 30s prebuilds burst likely happened right at the juncture of two 1min "periods", which would explain why not all 50 surplus prebuilds got rate-limited
rate-limited prebuilds should get state = 'aborted' instead of remaining in state = 'queued'

gtsiolis · 2022-03-01T15:06:24Z

See also relevant discussion (internal) in case this is related. Cc @jankeromnes @JanKoehnlein @jldec

jankeromnes · 2022-03-01T15:12:23Z

Ah, cool, I didn't know we could see webhook errors on GitLab. Thanks @gtsiolis!

Would be curious to see if we can get the actual error messages when a webhook call fails.

gtsiolis · 2022-03-01T15:14:49Z

Would be curious to see if we can get the actual error messages when a webhook call fails.

@jankeromnes Yes, we can. Here're the two most recent errors seen in our fork.

jankeromnes · 2022-03-01T15:22:15Z

Hey, thanks a lot for surfacing these. 💯🙏

Both seem to have failed with:

Internal error occurred while delivering this webhook.
Error: Net::ReadTimeout

I wonder if the timeout means Gitpod didn't respond in time, or if it's something inside GitLab that timed out. 🤔

In any case, this feature looks super useful for debugging webhook problems (especially now that we're inventing a new webhook problem: the rate limit 😅)

jankeromnes · 2022-03-02T09:35:20Z

Two new errors on the current deployment, and rate-limiting doesn't work 🙄

TypeError: Cannot read properties of undefined (reading 'statusCode')
    at /app/node_modules/@gitbeaker/node/dist/index.js:168:43
    at step (/app/node_modules/@gitbeaker/node/dist/index.js:82:23)
    at Object.next (/app/node_modules/@gitbeaker/node/dist/index.js:63:53)
    at fulfilled (/app/node_modules/@gitbeaker/node/dist/index.js:53:58)

(node:29) UnhandledPromiseRejectionWarning: Error: NotFoundError
    at Object.<anonymous> (/app/node_modules/@gitpod/server/dist/src/errors/index.js:26:41)
    at Generator.next (<anonymous>)
    at /app/node_modules/@gitpod/server/dist/src/errors/index.js:13:71
    at new Promise (<anonymous>)
    at __awaiter (/app/node_modules/@gitpod/server/dist/src/errors/index.js:9:12)
    at Object.create (/app/node_modules/@gitpod/server/dist/src/errors/index.js:21:16)
    at GitlabContextParser.<anonymous> (/app/node_modules/@gitpod/server/dist/src/gitlab/gitlab-context-parser.js:187:52)
    at Generator.next (<anonymous>)
    at fulfilled (/app/node_modules/@gitpod/server/dist/src/gitlab/gitlab-context-parser.js:19:58)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)

EDIT: Didn't understand where these were coming from, so I just hard-reset my PR to when it still worked (un-rebased + undid my latest changes)

jankeromnes · 2022-03-02T09:36:41Z

/werft run with-clean-slate-deployment

👍 started the job as gitpod-build-jx-rt-lmt-prblds.3

geropl · 2022-03-02T14:03:15Z

components/server/src/websocket/websocket-connection-manager.ts

@@ -358,7 +358,7 @@ class GitpodJsonRpcProxyFactory<T extends object> extends JsonRpcProxyFactory<T>
                    throw rlRejected;
                }
                log.warn({ userId }, "Rate limiter prevents accessing method due to too many requests.", rlRejected, { method });
-                throw new ResponseError<RateLimiterError>(ErrorCodes.TOO_MANY_REQUESTS, "too many requests", { method, retryAfter: Math.round(rlRejected.msBeforeNext / 1000) || 1 });
+                throw new ResponseError<RateLimiterError>(ErrorCodes.TOO_MANY_REQUESTS, "too many requests", { method, retryAfter: Math.ceil(rlRejected.msBeforeNext / 1000) || 1 });


geropl · 2022-03-02T14:08:54Z

components/server/ee/src/prebuilds/prebuild-manager.ts

+            try {
+                await UserRateLimiter.instance(this.config.rateLimiter).consume(user.id, startPrebuild);
+            } catch (error) {
+                span.setTag("starting", false);


Prefixing with the current span name, e.g. startPrebuild.starting is nice for discoverability in honeycomb.

geropl · 2022-03-02T14:15:31Z

@jankeromnes Thx for tackling this long-standing yet very annoying issue. Two thoughts:

the current (draft) version throttles per user. With an additional method here we could turn it into per-repo
have you thought about using the DB as ground-truth about how many workspace are already running per repo? 🤔 This would:
- effectively share rate-limiting state across server instances: the current approach allows for 6 x 50 = 300 prebuilds / 1 minutes, potentially
- this could pave the way to support user-friendlier modes of handling over-traffic compared to "accept first, drop all others"

jankeromnes · 2022-03-02T14:26:13Z

Working again after un-rebase! 🙌

$ kubectl logs server-7577d78b89-5jsmr server | grep -i "rate limit" | wc -l
14

mysql> select state, count(id) prebuilds from d_b_prebuilt_workspace where creationTime > '2022-03-02 14:06' group by state order by
prebuilds desc;
+-----------+-----------+
| state     | prebuilds |
+-----------+-----------+
| available |        86 |
| queued    |        14 |
+-----------+-----------+
2 rows in set (0.00 sec)

Now back to marking cancelled Prebuilds as "cancelled" (state = 'aborted') instead of "pending" (state = 'queued').

Thx for tackling this long-standing yet very annoying issue.

@geropl You're quite welcome! 😊 Also super happy that we're moving forward with this.

the current (draft) version throttles per user. With an additional method here we could turn it into per-repo

Aha 💡 thanks for the pointer. I think throttling per user makes sense for now, unless you happen to be the person who installed Gitpod for 3 or more legit super-high-frequency projects.

I'm a bit hesitant to make this PR more complicated / spend more time on it (because of higher priorities), but I agree that eventually we'll most likely want to rate-limit per repo / per project.

have you thought about using the DB as ground-truth about how many workspace are already running per repo? 🤔

I agree that this is desirable, but slightly more complicated. See also Kyle's comment in the issue: #5176 (comment) (maybe it's best to continue this discussion there)

effectively share rate-limiting state across server instances: the current approach allows for 6 x 50 = 300 prebuilds / 1 minutes, potentially

Aha, right, I'm currently testing with a single server instance, which is why I didn't observe this.

Do you know if there is some stability on which server instance responds to you for repeated requests (you = the GitHub App most likely), or if the selection is purely random for every request? If it's purely random, we may want to fix the rate-limiter itself.

jankeromnes · 2022-03-02T14:51:13Z

effectively share rate-limiting state across server instances: the current approach allows for 6 x 50 = 300 prebuilds / 1 minutes, potentially

Aha, right, I'm currently testing with a single server instance, which is why I didn't observe this.

I've tested this PR with 6 servers, using kubectl scale deployments/server --replicas=6.

mysql> select state, count(id) prebuilds from d_b_prebuilt_workspace where creationTime > '2022-03-02 14:42' group by state order by
prebuilds desc;
+-----------+-----------+
| state     | prebuilds |
+-----------+-----------+
| available |       100 |
+-----------+-----------+
1 row in set (0.01 sec)

mysql> select state, count(id) prebuilds from d_b_prebuilt_workspace where creationTime > '2022-03-02 14:46' group by state order by
prebuilds desc;
+-----------+-----------+
| state     | prebuilds |
+-----------+-----------+
| available |       100 |
+-----------+-----------+
1 row in set (0.01 sec)

Looks like 6 replicas defeat the rate-limiter. 😞

Maybe we need to make the rate-limiter track points across all replicas? What would be the cleanest solution?

jankeromnes · 2022-03-03T11:03:16Z

While this approach (i.e. using the server rate-limiter inside the PrebuildManager) basically works, it has the problem of multiplying the allowed rate limit by the number of server instances in the cluster.

I still believe that we could fix the rate-limiter itself to use the DB in order to share a single state between all server instances of a cluster (thus fixing this approach, and also fixing every other rate-limited server method).

However, for the specific issue of prebuild spam, we've discussed another approach (i.e. not using the rate-limiter, but making the PrebuildManager check the DB for how many prebuilds were previously started on the same cloneURL), which seems even simpler and could immediately prevent this type of outage.

Thus, closing this Pull Request for now. @andrew-farries and @easyCZ, if you do find anything useful in this PR, please feel free to recycle it as you see fit. 🙂 Many thanks for taking this over! I'm also happy to help with any questions here.

jankeromnes added 2 commits March 1, 2022 11:55

[server] Rate-limit 'triggerPrebuild' method to 50/min

fcc70e3

[server] Also rate-limit 'PrebuildManager.startPrebuild' (used by app…

35e97c1

…s & webhooks)

jankeromnes added feature: prebuilds operations: past incident This issue arose during a past incident or its post-mortem team: webapp Issue belongs to the WebApp team labels Mar 1, 2022

roboquat added release-note do-not-merge/work-in-progress size/M labels Mar 1, 2022

jankeromnes force-pushed the jx/rt-lmt-prblds branch from 35e97c1 to 79ee994 Compare March 1, 2022 16:46

jankeromnes force-pushed the jx/rt-lmt-prblds branch from 79ee994 to 35e97c1 Compare March 2, 2022 13:56

geropl reviewed Mar 2, 2022

View reviewed changes

jankeromnes closed this Mar 3, 2022

easyCZ mentioned this pull request Mar 4, 2022

server: rate limit prebuilds by cloneURL #8568

Merged

1 task

jankeromnes deleted the jx/rt-lmt-prblds branch March 24, 2022 14:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rate-limit new prebuilds to 50 / user / minute to prevent incidents #8504

Rate-limit new prebuilds to 50 / user / minute to prevent incidents #8504

Uh oh!

jankeromnes commented Mar 1, 2022 •

edited

Loading

Uh oh!

codecov bot commented Mar 1, 2022 •

edited

Loading

Uh oh!

jankeromnes commented Mar 1, 2022 •

edited by werft-gitpod-dev-com bot

Loading

Uh oh!

jankeromnes commented Mar 1, 2022 •

edited

Loading

Uh oh!

gtsiolis commented Mar 1, 2022

Uh oh!

jankeromnes commented Mar 1, 2022

Uh oh!

gtsiolis commented Mar 1, 2022

Uh oh!

jankeromnes commented Mar 1, 2022 •

edited by werft-gitpod-dev-com bot

Loading

Uh oh!

jankeromnes commented Mar 2, 2022 •

edited

Loading

Uh oh!

jankeromnes commented Mar 2, 2022 •

edited by werft-gitpod-dev-com bot

Loading

Uh oh!

geropl Mar 2, 2022

Uh oh!

geropl Mar 2, 2022

Uh oh!

geropl commented Mar 2, 2022

Uh oh!

jankeromnes commented Mar 2, 2022 •

edited

Loading

Uh oh!

jankeromnes commented Mar 2, 2022 •

edited

Loading

Uh oh!

jankeromnes commented Mar 3, 2022 •

edited by werft-gitpod-dev-com bot

Loading

Uh oh!

Uh oh!

Rate-limit new prebuilds to 50 / user / minute to prevent incidents #8504

Rate-limit new prebuilds to 50 / user / minute to prevent incidents #8504

Uh oh!

Conversation

jankeromnes commented Mar 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue(s)

How to test

Release Notes

Documentation

Uh oh!

codecov bot commented Mar 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jankeromnes commented Mar 1, 2022 • edited by werft-gitpod-dev-com bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jankeromnes commented Mar 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gtsiolis commented Mar 1, 2022

Uh oh!

jankeromnes commented Mar 1, 2022

Uh oh!

gtsiolis commented Mar 1, 2022

Uh oh!

jankeromnes commented Mar 1, 2022 • edited by werft-gitpod-dev-com bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jankeromnes commented Mar 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jankeromnes commented Mar 2, 2022 • edited by werft-gitpod-dev-com bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

geropl Mar 2, 2022

Choose a reason for hiding this comment

Uh oh!

geropl Mar 2, 2022

Choose a reason for hiding this comment

Uh oh!

geropl commented Mar 2, 2022

Uh oh!

jankeromnes commented Mar 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jankeromnes commented Mar 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jankeromnes commented Mar 3, 2022 • edited by werft-gitpod-dev-com bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

jankeromnes commented Mar 1, 2022 •

edited

Loading

codecov bot commented Mar 1, 2022 •

edited

Loading

jankeromnes commented Mar 1, 2022 •

edited by werft-gitpod-dev-com bot

Loading

jankeromnes commented Mar 1, 2022 •

edited

Loading

jankeromnes commented Mar 1, 2022 •

edited by werft-gitpod-dev-com bot

Loading

jankeromnes commented Mar 2, 2022 •

edited

Loading

jankeromnes commented Mar 2, 2022 •

edited by werft-gitpod-dev-com bot

Loading

jankeromnes commented Mar 2, 2022 •

edited

Loading

jankeromnes commented Mar 2, 2022 •

edited

Loading

jankeromnes commented Mar 3, 2022 •

edited by werft-gitpod-dev-com bot

Loading