-
Notifications
You must be signed in to change notification settings - Fork 160
chore(worker,web): Repo indexing stability improvements + perf improvements to web #563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
CLEANUP | ||
} | ||
|
||
model RepoJob { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might want to name this to RepoIndexJob
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: still need to perform the migration
connections RepoToConnection[] | ||
imageUrl String? | ||
/// @deprecated status tracking is now done via the `jobs` table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonder if we can just remove this field all together? Will need to think about the migration story.
@brendan-kellam your pull request is missing a changelog! |
...well this PR escalated quickly 👀
TL;DR: solved some repo indexing concurrency issues by moving from BullMQ to GroupMQ, refactored how we represent indexing jobs in the db, and generally deleted a ton of things that got in the way of web perf.
Where it started
We were getting bug reports (#523, #436) that repo indexing would fail at the clone or fetch stage with strange git errors. After some investigation on a test deployment of ~1k repos, we noticed that the root cause seemed to be that multiple indexing jobs could run on the same repository at the same time. When this happens, two workers will attempt to perform git operations on the same folder, resulting in the weird git behaviour we were seeing. For example:
repoid=1
repoid=1
/repos/1
/repos/1
<-- now two git processes are operating on the same repo. bad news -->
Q: Why were two workers being scheduled for the same repository in the first place?
A: I'm still not certain, but here's my theory: The job scheduler uses the the database in order to schedule new indexing jobs. This happens in two operations: 1) fetch all repositories that need indexing, and 2) create BullMQ jobs for these repositories and flag them as indexing. We were not performing 1 and 2 in a transaction, so this was not an atomic operation. I noticed that failures would typically happen in our k8s cluster when a new replica was being spun up. Likely, we were getting interleavings between replica A and B something like
A:1, B:1, A:2, B:2
, resulting in duplicate jobs for the same set of repos.Put another way, we were using the Postgres database as a distributed lock to enforce the "1 repo per worker" invariant. This felt error prone - what we really want is the invariant to be enforced at the queue level within Redis.
BullMQ has the concept of groups:
This sounds like the perfect solution: each repository can be in it's own group such that we are guaranteed that two jobs operating on the same repository must be processed one-by-one. Unfortunately, this feature is only available in the commercial version of BullMQ, and taking a dependency on a commercially licensed upstream dependency felt like a non-starter.
Enter groupmq - it's a brand new library (literally on v1.0.0) built by OpenPanel.dev. It supports grouping, has a similar api as BullMQ, and is MIT licensed. Was slightly hesitant to use such a new library, but their docs look good, so I decided to move to that.
repoIndexManager.ts
contains the new & improved indexer. In addition to moving to GroupMQ, I've also moved to representing job status in a separateRepoJob
table in s.t., we can maintain a 1:1 mapping between a job in Redis and how we represent it in the database. This PR deprecates therepoIndexingStatus
field.Where it went
After updating the indexer and moving to using the
RepoJob
table, I had to refactor how the progress indicators, repo carousel, and repo table work. This led into a bunch of chore work in web:/connections
viewTODO:
Fixes #523
Fixes #436
Fixes #462
Fixes #530
Fixes #452