Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rate limit prebuilds #5176

Closed
jankeromnes opened this issue Aug 12, 2021 · 14 comments · Fixed by #8568
Closed

Rate limit prebuilds #5176

jankeromnes opened this issue Aug 12, 2021 · 14 comments · Fixed by #8568
Assignees
Labels
team: webapp Issue belongs to the WebApp team type: epic

Comments

@jankeromnes
Copy link
Contributor

Bug description

Today at around 13:24 UTC, there was a sudden jump in the number of prebuilds: Over about a minute, the repository https://gitlab.com/yo/gitlab triggered nearly 100 new prebuilds.

Unfortunately, this put excessive pressure on the US cluster, and in particular on the database, which lead to several workspaces not being able to start (for example: https://community.gitpod.io/t/django-pod-is-down/4541).

This incident lasted about 20 minutes and was logged here: https://www.gitpodstatus.com/incidents/ddyzghh3mfbm

Steps to reproduce

Instantly trigger 100 prebuilds

Expected behavior

There should be no outage, especially in the database

Example repository

No response

Anything else?

Maybe it would make sense to introduce a reasonable rate limit on prebuilds, for example per repository.

@jankeromnes jankeromnes added feature: prebuilds type: incident Gitpod.io service is unstable labels Aug 12, 2021
@JanKoehnlein
Copy link
Contributor

/schedule

@csweichel csweichel added the team: webapp Issue belongs to the WebApp team label Aug 24, 2021
@AlexTugarev AlexTugarev self-assigned this Sep 14, 2021
@AlexTugarev
Copy link
Member

AlexTugarev commented Sep 14, 2021

Looking into the old solution for now: https://github.com/gitpod-io/gitpod/blob/main/components/server/ee/src/prebuilds/prebuild-queue-maintainer.ts

I hope, @csweichel or @geropl find time to provide more details on this, please 🙏🏻


After thinking this through. I think the prebuild queue thingy is not the right way to solve this problem. Instead we should add a proper rate limit on projects, which equals to repos in a sense.


'Talked to Sven and we decide to try to solve it with an extra storage to the side of servers. That should be basically a queue and servers should pick a prebuild event when ready to execute. We assume that having a proper tech that already solves the queueing (such as redis) would also allow us to manage prebuilds properly, i.e. enforce rate limitation and implement throttling.

@JanKoehnlein
Copy link
Contributor

I don't think this is actively worked on, and I also think this has to be specified and broken into smaller issues . Removing it from groundwork.

@stale
Copy link

stale bot commented Jan 26, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the meta: stale This issue/PR is stale and will be closed soon label Jan 26, 2022
@geropl
Copy link
Member

geropl commented Feb 3, 2022

This just lead to an incident: https://app.incident.io/incidents/101
Thus added it to #7812

/cc @jldec

@stale stale bot removed the meta: stale This issue/PR is stale and will be closed soon label Feb 3, 2022
@jankeromnes jankeromnes added the priority: highest (user impact) Directly user impacting label Feb 3, 2022
@jldec jldec moved this to Scheduled in 🍎 WebApp Team Feb 3, 2022
@jldec
Copy link
Contributor

jldec commented Feb 3, 2022

Scheduled this issue.
Similar to #7400

@jldec
Copy link
Contributor

jldec commented Feb 24, 2022

Remove highest priority label - not 🔥.

@jldec jldec added the priority: highest (user impact) Directly user impacting label Feb 27, 2022
@jldec
Copy link
Contributor

jldec commented Feb 27, 2022

Back on the urgent list after another close call (internal)
cc: @kylos101

@jankeromnes
Copy link
Contributor Author

jankeromnes commented Feb 28, 2022

Happy to look into implementing a reasonable rate limit that doesn't break normal(+) usage but prevents incidents.

@jankeromnes jankeromnes self-assigned this Feb 28, 2022
@jldec jldec changed the title Sudden jump in prebuilds can lead to brief outage Rate limit prebuilds Mar 1, 2022
@jankeromnes
Copy link
Contributor Author

jankeromnes commented Mar 1, 2022

I did some research (internal) on our top prebuild rates for February 2022:

Research reveals three top rate classes:

  • "Abuse-like" (50-85+ / min)
  • "Dependabot" (25-50 / min)
  • "Popular & mirrors" (up to 25 / min)

I think we'll want to stop the "Abuse-like", but continue supporting the "Popular & mirrors".

Also, the "Dependabot" case is interesting:

  • Developers probably don't need all Dependabot pushes to be prebuilt all the time. Maybe it's okay to fail some prebuilds in case of high-frequency push bursts?
  • Also, auto-enabling Incremental Prebuilds for top 10 "Dependabot" repositories would likely help us significantly

All-in-all, I believe that an initial per-user rate limit at 50 prebuilds / minute makes sense (i.e. allow up to 50 prebuilds per minute, then reject any new prebuild requests beyond that until the next minute):

  • only affects "abuse-like" repositories for now
  • might already be sufficient to prevent outages (e.g. the initial 20min outage on 12 Aug 2021 was caused by a 100 prebuilds / minute rate)

@kylos101
Copy link
Contributor

kylos101 commented Mar 1, 2022

Hi @jankeromnes , would it be prohibitive to also include a check to see how many total prebuilds this user is running now, and not allow them to do more than 50?

@jankeromnes
Copy link
Contributor Author

jankeromnes commented Mar 1, 2022

Hi @kylos101! I think this is an interesting, but somewhat more complicated proposal. For one, it's a bit harder to research how many prebuilds in progress all users have at any given time, in order to get the full picture of what normal vs abuse looks like (but I think it can be done). Also, long-building, high-frequency projects like GitLab might well fit under a 25/min rate, but have prebuilds running for 45min, thus potentially ramping up to 1000+ parallel prebuilds at certain times. Suddenly failing 950+ of these might have dramatic consequences (and, if we do fail these 950+ excess prebuilds, I think we need to be smart about picking which ones should get cancelled).

All-in-all, I think that:

  • Short term, the proposed rate-limit on the prebuild trigger is already a very simple, low-risk fix that can prevent outages like we've seen in the past without disrupting normal UX
  • Longer term, we'll need to get better at being a CI, i.e. more clearly define how many jobs users are allowed to run in parallel; how things get queued up or dropped if there are too many requests; how always-increasing queues are handled. Also, we'll likely want to do things like cancelling running Prebuilds for commits that are no longer HEAD of a branch or PR (e.g. new push, force-push, branch deleted, etc could cancel the older Prebuilds); rolling out Incremental Prebuilds to more projects (some quick wins there); and maybe consider skipping new Prebuilds for "minor changes" (if we can detect these at all)

Just my 2 cents though, I may be wrong. 😊 Please feel free to open a feature request about limiting concurrent prebuilds per user, and I'd be happy to help with research and/or implementation.

@kylos101
Copy link
Contributor

kylos101 commented Mar 1, 2022

Hey @jankeromnes 👋 ,

I think this is an interesting, but somewhat more complicated proposal

I thought that might be, but wanted to ask for the sake of posterity (my lens was the one user, two repos, and ~100 PRs). Thank you very much for your thoughtful response!

Short term, the proposed rate-limit on the prebuild trigger is already a very simple, low-risk fix that can prevent outages like we've seen in the past without disrupting normal UX

Totally agreed, this will 💯 help! 🙏 🌟

Longer term

Those are excellent points! 🥇 I appreciate you sharing the related use cases / different ways to control the throttle.

@jldec is there a related epic for get better at being a CI? I ask because @jankeromnes made some great points (above), and I'm not sure if they exist elsewhere in a related epic. Also, it might make sense to cancel prebuilds if a PR is merged or closed w/o merge.

@geropl
Copy link
Member

geropl commented Mar 3, 2022

We just discussed an possible approach to use the DB to limit the starting rate of prebuilds per repo.
@andrew-farries and @easyCZ are looking into this together, while I can help out anytime.

@geropl geropl moved this from Scheduled to In Progress in 🍎 WebApp Team Mar 4, 2022
Repository owner moved this from In Progress to Done in 🍎 WebApp Team Mar 8, 2022
@jldec jldec added type: epic and removed feature: prebuilds type: incident Gitpod.io service is unstable priority: highest (user impact) Directly user impacting labels Apr 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment