-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rate-limit new prebuilds to 50 / user / minute to prevent incidents #8504
Conversation
Codecov Report
@@ Coverage Diff @@
## main #8504 +/- ##
=======================================
Coverage 11.17% 11.17%
=======================================
Files 18 18
Lines 993 993
=======================================
Hits 111 111
Misses 880 880
Partials 2 2
Flags with carried forward coverage won't be shown. Click here to find out more. Continue to review full report at Codecov.
|
Re-roll 🎲 /werft run 👍 started the job as gitpod-build-jx-rt-lmt-prblds.1 |
Expected
Current state
13 rate limit warnings{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","serviceContext":{"service":"server","version":"jx-rt-lmt-prblds.1"},"component":"server","severity":"WARNING","time":"2022-03-01T14:39:03.082Z","context":{"userId":"de62eb36-a3ce-4299-8282-72bb58485905"},"message":"Rate limiter prevents accessing method due to too many requests.","payload":[{"remainingPoints":0,"msBeforeNext":8499,"consumedPoints":51,"isFirstInDuration":false},{"method":"startPrebuild"}]}
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","serviceContext":{"service":"server","version":"jx-rt-lmt-prblds.1"},"component":"server","severity":"WARNING","time":"2022-03-01T14:39:03.085Z","context":{"userId":"de62eb36-a3ce-4299-8282-72bb58485905"},"message":"Rate limiter prevents accessing method due to too many requests.","payload":[{"remainingPoints":0,"msBeforeNext":8495,"consumedPoints":52,"isFirstInDuration":false},{"method":"startPrebuild"}]}
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","serviceContext":{"service":"server","version":"jx-rt-lmt-prblds.1"},"component":"server","severity":"WARNING","time":"2022-03-01T14:39:03.095Z","context":{"userId":"de62eb36-a3ce-4299-8282-72bb58485905"},"message":"Rate limiter prevents accessing method due to too many requests.","payload":[{"remainingPoints":0,"msBeforeNext":8485,"consumedPoints":53,"isFirstInDuration":false},{"method":"startPrebuild"}]}
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","serviceContext":{"service":"server","version":"jx-rt-lmt-prblds.1"},"component":"server","severity":"WARNING","time":"2022-03-01T14:39:03.098Z","context":{"userId":"de62eb36-a3ce-4299-8282-72bb58485905"},"message":"Rate limiter prevents accessing method due to too many requests.","payload":[{"remainingPoints":0,"msBeforeNext":8483,"consumedPoints":54,"isFirstInDuration":false},{"method":"startPrebuild"}]}
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","serviceContext":{"service":"server","version":"jx-rt-lmt-prblds.1"},"component":"server","severity":"WARNING","time":"2022-03-01T14:39:03.127Z","context":{"userId":"de62eb36-a3ce-4299-8282-72bb58485905"},"message":"Rate limiter prevents accessing method due to too many requests.","payload":[{"remainingPoints":0,"msBeforeNext":8453,"consumedPoints":55,"isFirstInDuration":false},{"method":"startPrebuild"}]}
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","serviceContext":{"service":"server","version":"jx-rt-lmt-prblds.1"},"component":"server","severity":"WARNING","time":"2022-03-01T14:39:03.147Z","context":{"userId":"de62eb36-a3ce-4299-8282-72bb58485905"},"message":"Rate limiter prevents accessing method due to too many requests.","payload":[{"remainingPoints":0,"msBeforeNext":8434,"consumedPoints":56,"isFirstInDuration":false},{"method":"startPrebuild"}]}
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","serviceContext":{"service":"server","version":"jx-rt-lmt-prblds.1"},"component":"server","severity":"WARNING","time":"2022-03-01T14:39:03.169Z","context":{"userId":"de62eb36-a3ce-4299-8282-72bb58485905"},"message":"Rate limiter prevents accessing method due to too many requests.","payload":[{"remainingPoints":0,"msBeforeNext":8411,"consumedPoints":57,"isFirstInDuration":false},{"method":"startPrebuild"}]}
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","serviceContext":{"service":"server","version":"jx-rt-lmt-prblds.1"},"component":"server","severity":"WARNING","time":"2022-03-01T14:39:03.185Z","context":{"userId":"de62eb36-a3ce-4299-8282-72bb58485905"},"message":"Rate limiter prevents accessing method due to too many requests.","payload":[{"remainingPoints":0,"msBeforeNext":8395,"consumedPoints":58,"isFirstInDuration":false},{"method":"startPrebuild"}]}
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","serviceContext":{"service":"server","version":"jx-rt-lmt-prblds.1"},"component":"server","severity":"WARNING","time":"2022-03-01T14:39:03.194Z","context":{"userId":"de62eb36-a3ce-4299-8282-72bb58485905"},"message":"Rate limiter prevents accessing method due to too many requests.","payload":[{"remainingPoints":0,"msBeforeNext":8386,"consumedPoints":59,"isFirstInDuration":false},{"method":"startPrebuild"}]}
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","serviceContext":{"service":"server","version":"jx-rt-lmt-prblds.1"},"component":"server","severity":"WARNING","time":"2022-03-01T14:39:03.196Z","context":{"userId":"de62eb36-a3ce-4299-8282-72bb58485905"},"message":"Rate limiter prevents accessing method due to too many requests.","payload":[{"remainingPoints":0,"msBeforeNext":8385,"consumedPoints":60,"isFirstInDuration":false},{"method":"startPrebuild"}]}
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","serviceContext":{"service":"server","version":"jx-rt-lmt-prblds.1"},"component":"server","severity":"WARNING","time":"2022-03-01T14:39:03.201Z","context":{"userId":"de62eb36-a3ce-4299-8282-72bb58485905"},"message":"Rate limiter prevents accessing method due to too many requests.","payload":[{"remainingPoints":0,"msBeforeNext":8379,"consumedPoints":61,"isFirstInDuration":false},{"method":"startPrebuild"}]}
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","serviceContext":{"service":"server","version":"jx-rt-lmt-prblds.1"},"component":"server","severity":"WARNING","time":"2022-03-01T14:39:03.210Z","context":{"userId":"de62eb36-a3ce-4299-8282-72bb58485905"},"message":"Rate limiter prevents accessing method due to too many requests.","payload":[{"remainingPoints":0,"msBeforeNext":8371,"consumedPoints":62,"isFirstInDuration":false},{"method":"startPrebuild"}]}
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","serviceContext":{"service":"server","version":"jx-rt-lmt-prblds.1"},"component":"server","severity":"WARNING","time":"2022-03-01T14:39:03.245Z","context":{"userId":"de62eb36-a3ce-4299-8282-72bb58485905"},"message":"Rate limiter prevents accessing method due to too many requests.","payload":[{"remainingPoints":0,"msBeforeNext":8335,"consumedPoints":63,"isFirstInDuration":false},{"method":"startPrebuild"}]} Something seems off here 🤔 EDIT:
|
See also relevant discussion (internal) in case this is related. Cc @jankeromnes @JanKoehnlein @jldec |
Ah, cool, I didn't know we could see webhook errors on GitLab. Thanks @gtsiolis! Would be curious to see if we can get the actual error messages when a webhook call fails. |
@jankeromnes Yes, we can. Here're the two most recent errors seen in our fork. |
Hey, thanks a lot for surfacing these. 💯🙏 Both seem to have failed with:
I wonder if the timeout means Gitpod didn't respond in time, or if it's something inside GitLab that timed out. 🤔 In any case, this feature looks super useful for debugging webhook problems (especially now that we're inventing a new webhook problem: the rate limit 😅) |
35e97c1
to
79ee994
Compare
Two new errors on the current deployment, and rate-limiting doesn't work 🙄
EDIT: Didn't understand where these were coming from, so I just hard-reset my PR to when it still worked (un-rebased + undid my latest changes) |
/werft run with-clean-slate-deployment 👍 started the job as gitpod-build-jx-rt-lmt-prblds.3 |
79ee994
to
35e97c1
Compare
@@ -358,7 +358,7 @@ class GitpodJsonRpcProxyFactory<T extends object> extends JsonRpcProxyFactory<T> | |||
throw rlRejected; | |||
} | |||
log.warn({ userId }, "Rate limiter prevents accessing method due to too many requests.", rlRejected, { method }); | |||
throw new ResponseError<RateLimiterError>(ErrorCodes.TOO_MANY_REQUESTS, "too many requests", { method, retryAfter: Math.round(rlRejected.msBeforeNext / 1000) || 1 }); | |||
throw new ResponseError<RateLimiterError>(ErrorCodes.TOO_MANY_REQUESTS, "too many requests", { method, retryAfter: Math.ceil(rlRejected.msBeforeNext / 1000) || 1 }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
try { | ||
await UserRateLimiter.instance(this.config.rateLimiter).consume(user.id, startPrebuild); | ||
} catch (error) { | ||
span.setTag("starting", false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefixing with the current span name, e.g. startPrebuild.starting
is nice for discoverability in honeycomb.
@jankeromnes Thx for tackling this long-standing yet very annoying issue. Two thoughts:
|
Working again after un-rebase! 🙌
Now back to marking cancelled Prebuilds as "cancelled" (
@geropl You're quite welcome! 😊 Also super happy that we're moving forward with this.
Aha 💡 thanks for the pointer. I think throttling per user makes sense for now, unless you happen to be the person who installed Gitpod for 3 or more legit super-high-frequency projects. I'm a bit hesitant to make this PR more complicated / spend more time on it (because of higher priorities), but I agree that eventually we'll most likely want to rate-limit per repo / per project.
I agree that this is desirable, but slightly more complicated. See also Kyle's comment in the issue: #5176 (comment) (maybe it's best to continue this discussion there)
Aha, right, I'm currently testing with a single Do you know if there is some stability on which server instance responds to you for repeated requests (you = the GitHub App most likely), or if the selection is purely random for every request? If it's purely random, we may want to fix the rate-limiter itself. |
I've tested this PR with 6 servers, using
Looks like 6 replicas defeat the rate-limiter. 😞 Maybe we need to make the rate-limiter track points across all replicas? What would be the cleanest solution? |
While this approach (i.e. using the server rate-limiter inside the PrebuildManager) basically works, it has the problem of multiplying the allowed rate limit by the number of server instances in the cluster. I still believe that we could fix the rate-limiter itself to use the DB in order to share a single state between all server instances of a cluster (thus fixing this approach, and also fixing every other rate-limited server method). However, for the specific issue of prebuild spam, we've discussed another approach (i.e. not using the rate-limiter, but making the PrebuildManager check the DB for how many prebuilds were previously started on the same cloneURL), which seems even simpler and could immediately prevent this type of outage. Thus, closing this Pull Request for now. @andrew-farries and @easyCZ, if you do find anything useful in this PR, please feel free to recycle it as you see fit. 🙂 Many thanks for taking this over! I'm also happy to help with any questions here. |
Description
Rate-limit new prebuilds to 50 / user / minute to prevent incidents
Related Issue(s)
Fixes #5176
How to test
./push-100.sh
kubectl logs server-[TAB] server | grep -i "rate limit"
contains some rate limit warnings (as many as the cancelled prebuilds)Release Notes
Documentation