Rate limit prebuilds #5176

jankeromnes · 2021-08-12T14:41:47Z

Bug description

Today at around 13:24 UTC, there was a sudden jump in the number of prebuilds: Over about a minute, the repository https://gitlab.com/yo/gitlab triggered nearly 100 new prebuilds.

Unfortunately, this put excessive pressure on the US cluster, and in particular on the database, which lead to several workspaces not being able to start (for example: https://community.gitpod.io/t/django-pod-is-down/4541).

This incident lasted about 20 minutes and was logged here: https://www.gitpodstatus.com/incidents/ddyzghh3mfbm

Steps to reproduce

Instantly trigger 100 prebuilds

Expected behavior

There should be no outage, especially in the database

Example repository

No response

Anything else?

Maybe it would make sense to introduce a reasonable rate limit on prebuilds, for example per repository.

JanKoehnlein · 2021-08-13T08:11:19Z

/schedule

AlexTugarev · 2021-09-14T11:26:02Z

Looking into the old solution for now: https://github.com/gitpod-io/gitpod/blob/main/components/server/ee/src/prebuilds/prebuild-queue-maintainer.ts

I hope, @csweichel or @geropl find time to provide more details on this, please 🙏🏻

~~After thinking this through. I think the prebuild queue thingy is not the right way to solve this problem. Instead we should add a proper rate limit on projects, which equals to repos in a sense.~~

'Talked to Sven and we decide to try to solve it with an extra storage to the side of servers. That should be basically a queue and servers should pick a prebuild event when ready to execute. We assume that having a proper tech that already solves the queueing (such as redis) would also allow us to manage prebuilds properly, i.e. enforce rate limitation and implement throttling.

JanKoehnlein · 2021-10-27T13:12:51Z

I don't think this is actively worked on, and I also think this has to be specified and broken into smaller issues . Removing it from groundwork.

stale · 2022-01-26T01:15:21Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

geropl · 2022-02-03T08:50:00Z

This just lead to an incident: https://app.incident.io/incidents/101
Thus added it to #7812

/cc @jldec

jldec · 2022-02-03T13:09:23Z

Scheduled this issue.
Similar to #7400

jldec · 2022-02-24T09:35:44Z

Remove highest priority label - not 🔥.

jldec · 2022-02-27T21:46:53Z

Back on the urgent list after another close call (internal)
cc: @kylos101

jankeromnes · 2022-02-28T14:30:20Z

Happy to look into implementing a reasonable rate limit that doesn't break normal(+) usage but prevents incidents.

jankeromnes · 2022-03-01T11:45:01Z

I did some research (internal) on our top prebuild rates for February 2022:

Research reveals three top rate classes:

"Abuse-like" (50-85+ / min)

"Dependabot" (25-50 / min)

"Popular & mirrors" (up to 25 / min)

I think we'll want to stop the "Abuse-like", but continue supporting the "Popular & mirrors".

Also, the "Dependabot" case is interesting:

Developers probably don't need all Dependabot pushes to be prebuilt all the time. Maybe it's okay to fail some prebuilds in case of high-frequency push bursts?
Also, auto-enabling Incremental Prebuilds for top 10 "Dependabot" repositories would likely help us significantly

All-in-all, I believe that an initial per-user rate limit at 50 prebuilds / minute makes sense (i.e. allow up to 50 prebuilds per minute, then reject any new prebuild requests beyond that until the next minute):

only affects "abuse-like" repositories for now
might already be sufficient to prevent outages (e.g. the initial 20min outage on 12 Aug 2021 was caused by a 100 prebuilds / minute rate)

kylos101 · 2022-03-01T20:30:00Z

Hi @jankeromnes , would it be prohibitive to also include a check to see how many total prebuilds this user is running now, and not allow them to do more than 50?

jankeromnes · 2022-03-01T21:01:49Z

Hi @kylos101! I think this is an interesting, but somewhat more complicated proposal. For one, it's a bit harder to research how many prebuilds in progress all users have at any given time, in order to get the full picture of what normal vs abuse looks like (but I think it can be done). Also, long-building, high-frequency projects like GitLab might well fit under a 25/min rate, but have prebuilds running for 45min, thus potentially ramping up to 1000+ parallel prebuilds at certain times. Suddenly failing 950+ of these might have dramatic consequences (and, if we do fail these 950+ excess prebuilds, I think we need to be smart about picking which ones should get cancelled).

All-in-all, I think that:

Short term, the proposed rate-limit on the prebuild trigger is already a very simple, low-risk fix that can prevent outages like we've seen in the past without disrupting normal UX
Longer term, we'll need to get better at being a CI, i.e. more clearly define how many jobs users are allowed to run in parallel; how things get queued up or dropped if there are too many requests; how always-increasing queues are handled. Also, we'll likely want to do things like cancelling running Prebuilds for commits that are no longer HEAD of a branch or PR (e.g. new push, force-push, branch deleted, etc could cancel the older Prebuilds); rolling out Incremental Prebuilds to more projects (some quick wins there); and maybe consider skipping new Prebuilds for "minor changes" (if we can detect these at all)

Just my 2 cents though, I may be wrong. 😊 Please feel free to open a feature request about limiting concurrent prebuilds per user, and I'd be happy to help with research and/or implementation.

kylos101 · 2022-03-01T21:31:50Z

Hey @jankeromnes 👋 ,

I think this is an interesting, but somewhat more complicated proposal

I thought that might be, but wanted to ask for the sake of posterity (my lens was the one user, two repos, and ~100 PRs). Thank you very much for your thoughtful response!

Short term, the proposed rate-limit on the prebuild trigger is already a very simple, low-risk fix that can prevent outages like we've seen in the past without disrupting normal UX

Totally agreed, this will 💯 help! 🙏 🌟

Longer term

Those are excellent points! 🥇 I appreciate you sharing the related use cases / different ways to control the throttle.

@jldec is there a related epic for get better at being a CI? I ask because @jankeromnes made some great points (above), and I'm not sure if they exist elsewhere in a related epic. Also, it might make sense to cancel prebuilds if a PR is merged or closed w/o merge.

geropl · 2022-03-03T10:35:45Z

We just discussed an possible approach to use the DB to limit the starting rate of prebuilds per repo.
@andrew-farries and @easyCZ are looking into this together, while I can help out anytime.

jankeromnes added feature: prebuilds type: incident Gitpod.io service is unstable labels Aug 12, 2021

roboquat added the groundwork: scheduled label Aug 13, 2021

csweichel added the team: webapp Issue belongs to the WebApp team label Aug 24, 2021

AlexTugarev self-assigned this Sep 14, 2021

roboquat added groundwork: in progress and removed groundwork: scheduled labels Sep 14, 2021

JanKoehnlein removed the groundwork: in progress label Oct 27, 2021

stale bot added the meta: stale This issue/PR is stale and will be closed soon label Jan 26, 2022

geropl mentioned this issue Feb 3, 2022

Epic: Improve reliability of prebuilds and prebuild logs #7812

Closed

9 tasks

stale bot removed the meta: stale This issue/PR is stale and will be closed soon label Feb 3, 2022

geropl added this to 🍎 WebApp Team Feb 3, 2022

jankeromnes added the priority: highest (user impact) Directly user impacting label Feb 3, 2022

jldec moved this to Scheduled in 🍎 WebApp Team Feb 3, 2022

AlexTugarev removed their assignment Feb 9, 2022

jldec mentioned this issue Feb 21, 2022

Limit how many concurrent prebuilds can be triggered for a repo #7400

Closed

jldec removed the priority: highest (user impact) Directly user impacting label Feb 24, 2022

jldec added the priority: highest (user impact) Directly user impacting label Feb 27, 2022

geropl mentioned this issue Feb 28, 2022

[prebuild] Not able to consistently view logs and start related workspaces #8324

Closed

jankeromnes self-assigned this Feb 28, 2022

jldec changed the title ~~Sudden jump in prebuilds can lead to brief outage~~ Rate limit prebuilds Mar 1, 2022

This was referenced Mar 1, 2022

Rate-limit new prebuilds to 50 / user / minute to prevent incidents #8502

Closed

Rate-limit new prebuilds to 50 / user / minute to prevent incidents #8504

Closed

geropl unassigned jankeromnes Mar 3, 2022

jankeromnes assigned easyCZ and andrew-farries Mar 3, 2022

geropl moved this from Scheduled to In Progress in 🍎 WebApp Team Mar 4, 2022

easyCZ mentioned this issue Mar 4, 2022

server: rate limit prebuilds by cloneURL #8568

Merged

1 task

roboquat closed this as completed in #8568 Mar 8, 2022

Repository owner moved this from In Progress to Done in 🍎 WebApp Team Mar 8, 2022

jldec added type: epic and removed feature: prebuilds type: incident Gitpod.io service is unstable priority: highest (user impact) Directly user impacting labels Apr 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rate limit prebuilds #5176

Rate limit prebuilds #5176

jankeromnes commented Aug 12, 2021

JanKoehnlein commented Aug 13, 2021

AlexTugarev commented Sep 14, 2021 •

edited

Loading

JanKoehnlein commented Oct 27, 2021

stale bot commented Jan 26, 2022

geropl commented Feb 3, 2022

jldec commented Feb 3, 2022

jldec commented Feb 24, 2022

jldec commented Feb 27, 2022

jankeromnes commented Feb 28, 2022 •

edited

Loading

jankeromnes commented Mar 1, 2022 •

edited

Loading

kylos101 commented Mar 1, 2022

jankeromnes commented Mar 1, 2022 •

edited

Loading

kylos101 commented Mar 1, 2022

geropl commented Mar 3, 2022

Rate limit prebuilds #5176

Rate limit prebuilds #5176

Comments

jankeromnes commented Aug 12, 2021

Bug description

Steps to reproduce

Expected behavior

Example repository

Anything else?

JanKoehnlein commented Aug 13, 2021

AlexTugarev commented Sep 14, 2021 • edited Loading

JanKoehnlein commented Oct 27, 2021

stale bot commented Jan 26, 2022

geropl commented Feb 3, 2022

jldec commented Feb 3, 2022

jldec commented Feb 24, 2022

jldec commented Feb 27, 2022

jankeromnes commented Feb 28, 2022 • edited Loading

jankeromnes commented Mar 1, 2022 • edited Loading

kylos101 commented Mar 1, 2022

jankeromnes commented Mar 1, 2022 • edited Loading

kylos101 commented Mar 1, 2022

geropl commented Mar 3, 2022

AlexTugarev commented Sep 14, 2021 •

edited

Loading

jankeromnes commented Feb 28, 2022 •

edited

Loading

jankeromnes commented Mar 1, 2022 •

edited

Loading

jankeromnes commented Mar 1, 2022 •

edited

Loading