Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rate-limit new prebuilds to 50 / user / minute to prevent incidents #8504

Closed
wants to merge 2 commits into from

Conversation

jankeromnes
Copy link
Contributor

@jankeromnes jankeromnes commented Mar 1, 2022

Description

Rate-limit new prebuilds to 50 / user / minute to prevent incidents

Related Issue(s)

Fixes #5176

How to test

  1. Fork https://gitlab.com/jankeromnes/test-gitpod-prebuilds-rate-limit
  2. Add your fork as a project here: https://jx-rt-lmt-prblds.staging.gitpod-dev.com/new?team=user
  3. Open your fork in Gitpod (any Gitpod) and run ./push-100.sh
  4. Once all 100 commits have been pushed (20 commits to 5 branches), verify that:
    • most prebuilds actually started (but not all 100)
    • some prebuilds never started but got cancelled instead (between 0 and 50, depending on when the burst happened in the 1min rate-limit window)
    • kubectl logs server-[TAB] server | grep -i "rate limit" contains some rate limit warnings (as many as the cancelled prebuilds)

Release Notes

Rate-limit new prebuilds to 50 / user / minute to prevent incidents

Documentation

@jankeromnes jankeromnes added feature: prebuilds operations: past incident This issue arose during a past incident or its post-mortem team: webapp Issue belongs to the WebApp team labels Mar 1, 2022
@codecov
Copy link

codecov bot commented Mar 1, 2022

Codecov Report

Merging #8504 (35e97c1) into main (6db0f39) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #8504   +/-   ##
=======================================
  Coverage   11.17%   11.17%           
=======================================
  Files          18       18           
  Lines         993      993           
=======================================
  Hits          111      111           
  Misses        880      880           
  Partials        2        2           
Flag Coverage Δ
components-gitpod-cli-app 11.17% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 362f860...35e97c1. Read the comment docs.

@jankeromnes
Copy link
Contributor Author

jankeromnes commented Mar 1, 2022

Re-roll 🎲

/werft run

👍 started the job as gitpod-build-jx-rt-lmt-prblds.1

@jankeromnes
Copy link
Contributor Author

jankeromnes commented Mar 1, 2022

Expected

  • 100 commits pushed (20 to 5 branches)
  • 50 prebuilds started (state = available)
  • 50 prebuilds cancelled (state = aborted)
  • 50 rate limit warnings

Current state

  • 100 commits pushed (20 to 5 branches)
  • 87 prebuilds started (state = available)
  • 13 prebuilds "queued"? 🤔 (state = queued)
13 rate limit warnings
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","serviceContext":{"service":"server","version":"jx-rt-lmt-prblds.1"},"component":"server","severity":"WARNING","time":"2022-03-01T14:39:03.082Z","context":{"userId":"de62eb36-a3ce-4299-8282-72bb58485905"},"message":"Rate limiter prevents accessing method due to too many requests.","payload":[{"remainingPoints":0,"msBeforeNext":8499,"consumedPoints":51,"isFirstInDuration":false},{"method":"startPrebuild"}]}
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","serviceContext":{"service":"server","version":"jx-rt-lmt-prblds.1"},"component":"server","severity":"WARNING","time":"2022-03-01T14:39:03.085Z","context":{"userId":"de62eb36-a3ce-4299-8282-72bb58485905"},"message":"Rate limiter prevents accessing method due to too many requests.","payload":[{"remainingPoints":0,"msBeforeNext":8495,"consumedPoints":52,"isFirstInDuration":false},{"method":"startPrebuild"}]}
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","serviceContext":{"service":"server","version":"jx-rt-lmt-prblds.1"},"component":"server","severity":"WARNING","time":"2022-03-01T14:39:03.095Z","context":{"userId":"de62eb36-a3ce-4299-8282-72bb58485905"},"message":"Rate limiter prevents accessing method due to too many requests.","payload":[{"remainingPoints":0,"msBeforeNext":8485,"consumedPoints":53,"isFirstInDuration":false},{"method":"startPrebuild"}]}
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","serviceContext":{"service":"server","version":"jx-rt-lmt-prblds.1"},"component":"server","severity":"WARNING","time":"2022-03-01T14:39:03.098Z","context":{"userId":"de62eb36-a3ce-4299-8282-72bb58485905"},"message":"Rate limiter prevents accessing method due to too many requests.","payload":[{"remainingPoints":0,"msBeforeNext":8483,"consumedPoints":54,"isFirstInDuration":false},{"method":"startPrebuild"}]}
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","serviceContext":{"service":"server","version":"jx-rt-lmt-prblds.1"},"component":"server","severity":"WARNING","time":"2022-03-01T14:39:03.127Z","context":{"userId":"de62eb36-a3ce-4299-8282-72bb58485905"},"message":"Rate limiter prevents accessing method due to too many requests.","payload":[{"remainingPoints":0,"msBeforeNext":8453,"consumedPoints":55,"isFirstInDuration":false},{"method":"startPrebuild"}]}
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","serviceContext":{"service":"server","version":"jx-rt-lmt-prblds.1"},"component":"server","severity":"WARNING","time":"2022-03-01T14:39:03.147Z","context":{"userId":"de62eb36-a3ce-4299-8282-72bb58485905"},"message":"Rate limiter prevents accessing method due to too many requests.","payload":[{"remainingPoints":0,"msBeforeNext":8434,"consumedPoints":56,"isFirstInDuration":false},{"method":"startPrebuild"}]}
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","serviceContext":{"service":"server","version":"jx-rt-lmt-prblds.1"},"component":"server","severity":"WARNING","time":"2022-03-01T14:39:03.169Z","context":{"userId":"de62eb36-a3ce-4299-8282-72bb58485905"},"message":"Rate limiter prevents accessing method due to too many requests.","payload":[{"remainingPoints":0,"msBeforeNext":8411,"consumedPoints":57,"isFirstInDuration":false},{"method":"startPrebuild"}]}
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","serviceContext":{"service":"server","version":"jx-rt-lmt-prblds.1"},"component":"server","severity":"WARNING","time":"2022-03-01T14:39:03.185Z","context":{"userId":"de62eb36-a3ce-4299-8282-72bb58485905"},"message":"Rate limiter prevents accessing method due to too many requests.","payload":[{"remainingPoints":0,"msBeforeNext":8395,"consumedPoints":58,"isFirstInDuration":false},{"method":"startPrebuild"}]}
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","serviceContext":{"service":"server","version":"jx-rt-lmt-prblds.1"},"component":"server","severity":"WARNING","time":"2022-03-01T14:39:03.194Z","context":{"userId":"de62eb36-a3ce-4299-8282-72bb58485905"},"message":"Rate limiter prevents accessing method due to too many requests.","payload":[{"remainingPoints":0,"msBeforeNext":8386,"consumedPoints":59,"isFirstInDuration":false},{"method":"startPrebuild"}]}
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","serviceContext":{"service":"server","version":"jx-rt-lmt-prblds.1"},"component":"server","severity":"WARNING","time":"2022-03-01T14:39:03.196Z","context":{"userId":"de62eb36-a3ce-4299-8282-72bb58485905"},"message":"Rate limiter prevents accessing method due to too many requests.","payload":[{"remainingPoints":0,"msBeforeNext":8385,"consumedPoints":60,"isFirstInDuration":false},{"method":"startPrebuild"}]}
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","serviceContext":{"service":"server","version":"jx-rt-lmt-prblds.1"},"component":"server","severity":"WARNING","time":"2022-03-01T14:39:03.201Z","context":{"userId":"de62eb36-a3ce-4299-8282-72bb58485905"},"message":"Rate limiter prevents accessing method due to too many requests.","payload":[{"remainingPoints":0,"msBeforeNext":8379,"consumedPoints":61,"isFirstInDuration":false},{"method":"startPrebuild"}]}
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","serviceContext":{"service":"server","version":"jx-rt-lmt-prblds.1"},"component":"server","severity":"WARNING","time":"2022-03-01T14:39:03.210Z","context":{"userId":"de62eb36-a3ce-4299-8282-72bb58485905"},"message":"Rate limiter prevents accessing method due to too many requests.","payload":[{"remainingPoints":0,"msBeforeNext":8371,"consumedPoints":62,"isFirstInDuration":false},{"method":"startPrebuild"}]}
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","serviceContext":{"service":"server","version":"jx-rt-lmt-prblds.1"},"component":"server","severity":"WARNING","time":"2022-03-01T14:39:03.245Z","context":{"userId":"de62eb36-a3ce-4299-8282-72bb58485905"},"message":"Rate limiter prevents accessing method due to too many requests.","payload":[{"remainingPoints":0,"msBeforeNext":8335,"consumedPoints":63,"isFirstInDuration":false},{"method":"startPrebuild"}]}

Something seems off here 🤔

EDIT:

  • our 30s prebuilds burst likely happened right at the juncture of two 1min "periods", which would explain why not all 50 surplus prebuilds got rate-limited
  • rate-limited prebuilds should get state = 'aborted' instead of remaining in state = 'queued'

@gtsiolis
Copy link
Contributor

gtsiolis commented Mar 1, 2022

See also relevant discussion (internal) in case this is related. Cc @jankeromnes @JanKoehnlein @jldec

@jankeromnes
Copy link
Contributor Author

Ah, cool, I didn't know we could see webhook errors on GitLab. Thanks @gtsiolis!

Would be curious to see if we can get the actual error messages when a webhook call fails.

@gtsiolis
Copy link
Contributor

gtsiolis commented Mar 1, 2022

Would be curious to see if we can get the actual error messages when a webhook call fails.

@jankeromnes Yes, we can. Here're the two most recent errors seen in our fork.

  1. https://gitlab.com/gitpod-io/gitlab/-/hooks/10823912/hook_logs/2455346235 (internal)
  2. https://gitlab.com/gitpod-io/gitlab/-/hooks/10823912/hook_logs/2455045008 (internal)

@jankeromnes
Copy link
Contributor Author

jankeromnes commented Mar 1, 2022

Hey, thanks a lot for surfacing these. 💯🙏

Both seem to have failed with:

Internal error occurred while delivering this webhook.
Error: Net::ReadTimeout

I wonder if the timeout means Gitpod didn't respond in time, or if it's something inside GitLab that timed out. 🤔

In any case, this feature looks super useful for debugging webhook problems (especially now that we're inventing a new webhook problem: the rate limit 😅)

@jankeromnes
Copy link
Contributor Author

jankeromnes commented Mar 2, 2022

Two new errors on the current deployment, and rate-limiting doesn't work 🙄

TypeError: Cannot read properties of undefined (reading 'statusCode')
    at /app/node_modules/@gitbeaker/node/dist/index.js:168:43
    at step (/app/node_modules/@gitbeaker/node/dist/index.js:82:23)
    at Object.next (/app/node_modules/@gitbeaker/node/dist/index.js:63:53)
    at fulfilled (/app/node_modules/@gitbeaker/node/dist/index.js:53:58)
(node:29) UnhandledPromiseRejectionWarning: Error: NotFoundError
    at Object.<anonymous> (/app/node_modules/@gitpod/server/dist/src/errors/index.js:26:41)
    at Generator.next (<anonymous>)
    at /app/node_modules/@gitpod/server/dist/src/errors/index.js:13:71
    at new Promise (<anonymous>)
    at __awaiter (/app/node_modules/@gitpod/server/dist/src/errors/index.js:9:12)
    at Object.create (/app/node_modules/@gitpod/server/dist/src/errors/index.js:21:16)
    at GitlabContextParser.<anonymous> (/app/node_modules/@gitpod/server/dist/src/gitlab/gitlab-context-parser.js:187:52)
    at Generator.next (<anonymous>)
    at fulfilled (/app/node_modules/@gitpod/server/dist/src/gitlab/gitlab-context-parser.js:19:58)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)

EDIT: Didn't understand where these were coming from, so I just hard-reset my PR to when it still worked (un-rebased + undid my latest changes)

@jankeromnes
Copy link
Contributor Author

jankeromnes commented Mar 2, 2022

/werft run with-clean-slate-deployment

👍 started the job as gitpod-build-jx-rt-lmt-prblds.3

@@ -358,7 +358,7 @@ class GitpodJsonRpcProxyFactory<T extends object> extends JsonRpcProxyFactory<T>
throw rlRejected;
}
log.warn({ userId }, "Rate limiter prevents accessing method due to too many requests.", rlRejected, { method });
throw new ResponseError<RateLimiterError>(ErrorCodes.TOO_MANY_REQUESTS, "too many requests", { method, retryAfter: Math.round(rlRejected.msBeforeNext / 1000) || 1 });
throw new ResponseError<RateLimiterError>(ErrorCodes.TOO_MANY_REQUESTS, "too many requests", { method, retryAfter: Math.ceil(rlRejected.msBeforeNext / 1000) || 1 });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

try {
await UserRateLimiter.instance(this.config.rateLimiter).consume(user.id, startPrebuild);
} catch (error) {
span.setTag("starting", false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefixing with the current span name, e.g. startPrebuild.starting is nice for discoverability in honeycomb.

@geropl
Copy link
Member

geropl commented Mar 2, 2022

@jankeromnes Thx for tackling this long-standing yet very annoying issue. Two thoughts:

  • the current (draft) version throttles per user. With an additional method here we could turn it into per-repo
  • have you thought about using the DB as ground-truth about how many workspace are already running per repo? 🤔 This would:
    • effectively share rate-limiting state across server instances: the current approach allows for 6 x 50 = 300 prebuilds / 1 minutes, potentially
    • this could pave the way to support user-friendlier modes of handling over-traffic compared to "accept first, drop all others"

@jankeromnes
Copy link
Contributor Author

jankeromnes commented Mar 2, 2022

Working again after un-rebase! 🙌

$ kubectl logs server-7577d78b89-5jsmr server | grep -i "rate limit" | wc -l
14
mysql> select state, count(id) prebuilds from d_b_prebuilt_workspace where creationTime > '2022-03-02 14:06' group by state order by
prebuilds desc;
+-----------+-----------+
| state     | prebuilds |
+-----------+-----------+
| available |        86 |
| queued    |        14 |
+-----------+-----------+
2 rows in set (0.00 sec)

Now back to marking cancelled Prebuilds as "cancelled" (state = 'aborted') instead of "pending" (state = 'queued').


Thx for tackling this long-standing yet very annoying issue.

@geropl You're quite welcome! 😊 Also super happy that we're moving forward with this.

the current (draft) version throttles per user. With an additional method here we could turn it into per-repo

Aha 💡 thanks for the pointer. I think throttling per user makes sense for now, unless you happen to be the person who installed Gitpod for 3 or more legit super-high-frequency projects.

I'm a bit hesitant to make this PR more complicated / spend more time on it (because of higher priorities), but I agree that eventually we'll most likely want to rate-limit per repo / per project.

have you thought about using the DB as ground-truth about how many workspace are already running per repo? 🤔

I agree that this is desirable, but slightly more complicated. See also Kyle's comment in the issue: #5176 (comment) (maybe it's best to continue this discussion there)

effectively share rate-limiting state across server instances: the current approach allows for 6 x 50 = 300 prebuilds / 1 minutes, potentially

Aha, right, I'm currently testing with a single server instance, which is why I didn't observe this.

Do you know if there is some stability on which server instance responds to you for repeated requests (you = the GitHub App most likely), or if the selection is purely random for every request? If it's purely random, we may want to fix the rate-limiter itself.

@jankeromnes
Copy link
Contributor Author

jankeromnes commented Mar 2, 2022

effectively share rate-limiting state across server instances: the current approach allows for 6 x 50 = 300 prebuilds / 1 minutes, potentially

Aha, right, I'm currently testing with a single server instance, which is why I didn't observe this.

I've tested this PR with 6 servers, using kubectl scale deployments/server --replicas=6.

mysql> select state, count(id) prebuilds from d_b_prebuilt_workspace where creationTime > '2022-03-02 14:42' group by state order by
prebuilds desc;
+-----------+-----------+
| state     | prebuilds |
+-----------+-----------+
| available |       100 |
+-----------+-----------+
1 row in set (0.01 sec)
mysql> select state, count(id) prebuilds from d_b_prebuilt_workspace where creationTime > '2022-03-02 14:46' group by state order by
prebuilds desc;
+-----------+-----------+
| state     | prebuilds |
+-----------+-----------+
| available |       100 |
+-----------+-----------+
1 row in set (0.01 sec)

Looks like 6 replicas defeat the rate-limiter. 😞

Maybe we need to make the rate-limiter track points across all replicas? What would be the cleanest solution?

@jankeromnes
Copy link
Contributor Author

jankeromnes commented Mar 3, 2022

While this approach (i.e. using the server rate-limiter inside the PrebuildManager) basically works, it has the problem of multiplying the allowed rate limit by the number of server instances in the cluster.

I still believe that we could fix the rate-limiter itself to use the DB in order to share a single state between all server instances of a cluster (thus fixing this approach, and also fixing every other rate-limited server method).

However, for the specific issue of prebuild spam, we've discussed another approach (i.e. not using the rate-limiter, but making the PrebuildManager check the DB for how many prebuilds were previously started on the same cloneURL), which seems even simpler and could immediately prevent this type of outage.

Thus, closing this Pull Request for now. @andrew-farries and @easyCZ, if you do find anything useful in this PR, please feel free to recycle it as you see fit. 🙂 Many thanks for taking this over! I'm also happy to help with any questions here.

@jankeromnes jankeromnes closed this Mar 3, 2022
@jankeromnes jankeromnes deleted the jx/rt-lmt-prblds branch March 24, 2022 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress feature: prebuilds operations: past incident This issue arose during a past incident or its post-mortem release-note size/M team: webapp Issue belongs to the WebApp team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rate limit prebuilds
4 participants