Skip to content

Conversation

brendan-kellam
Copy link
Contributor

@brendan-kellam brendan-kellam commented Oct 15, 2025

image

...well this PR escalated quickly 👀

TL;DR: solved some repo indexing concurrency issues by moving from BullMQ to GroupMQ, refactored how we represent indexing jobs in the db, and generally deleted a ton of things that got in the way of web perf.

Where it started

We were getting bug reports (#523, #436) that repo indexing would fail at the clone or fetch stage with strange git errors. After some investigation on a test deployment of ~1k repos, we noticed that the root cause seemed to be that multiple indexing jobs could run on the same repository at the same time. When this happens, two workers will attempt to perform git operations on the same folder, resulting in the weird git behaviour we were seeing. For example:

  1. Worker 1 scheduled on repoid=1
  2. Worker 2 scheduled on repoid=1
  3. Worker 1 starts. Spawns git process to clone into /repos/1
  4. Worker 1 pre-empted by Node.JS runtime.
  5. Worker 2 starts. Spawns git process to clone into /repos/1
    <-- now two git processes are operating on the same repo. bad news -->
  6. Worker 2 receives error from git process that clone failed. Exists with job failed.
  7. Worker.1 pre-empted by Node.JS runtime.
  8. Worker 1 receives error from git process that clone failed. Exists with job failed.

Q: Why were two workers being scheduled for the same repository in the first place?
A: I'm still not certain, but here's my theory: The job scheduler uses the the database in order to schedule new indexing jobs. This happens in two operations: 1) fetch all repositories that need indexing, and 2) create BullMQ jobs for these repositories and flag them as indexing. We were not performing 1 and 2 in a transaction, so this was not an atomic operation. I noticed that failures would typically happen in our k8s cluster when a new replica was being spun up. Likely, we were getting interleavings between replica A and B something like A:1, B:1, A:2, B:2, resulting in duplicate jobs for the same set of repos.

Put another way, we were using the Postgres database as a distributed lock to enforce the "1 repo per worker" invariant. This felt error prone - what we really want is the invariant to be enforced at the queue level within Redis.

BullMQ has the concept of groups:

Groups allows you to use a single queue while distributing the jobs among groups so that the jobs are processed one by one relative to the group they belong to.

This sounds like the perfect solution: each repository can be in it's own group such that we are guaranteed that two jobs operating on the same repository must be processed one-by-one. Unfortunately, this feature is only available in the commercial version of BullMQ, and taking a dependency on a commercially licensed upstream dependency felt like a non-starter.

Enter groupmq - it's a brand new library (literally on v1.0.0) built by OpenPanel.dev. It supports grouping, has a similar api as BullMQ, and is MIT licensed. Was slightly hesitant to use such a new library, but their docs look good, so I decided to move to that.

repoIndexManager.ts contains the new & improved indexer. In addition to moving to GroupMQ, I've also moved to representing job status in a separate RepoJob table in s.t., we can maintain a 1:1 mapping between a job in Redis and how we represent it in the database. This PR deprecates the repoIndexingStatus field.

Where it went

After updating the indexer and moving to using the RepoJob table, I had to refactor how the progress indicators, repo carousel, and repo table work. This led into a bunch of chore work in web:

  • Removes long polling / client side fetching of repositories, in favour of fetching data in server components and having refresh buttons.
  • Updated progress indicator look. The progress indicator now will show progress the first time indexing of repositories.
image
  • Removes the warning and error nav indicators.
  • Optimizes the Ask mode search scope selector with virtualizing
  • Added separate search and ask navigation bar items
image
  • Removed the /connections view

TODO:

  • Test migration path
  • More prometheus metrics (Will do this in a separate PR)

Fixes #523
Fixes #436
Fixes #462
Fixes #530
Fixes #452

Copy link

coderabbitai bot commented Oct 15, 2025

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch bkellam/repo_indexing_v2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@blacksmith-sh blacksmith-sh bot deleted a comment from brendan-kellam Oct 16, 2025
@brendan-kellam brendan-kellam changed the title [wip] chore(worker): Repo indexing stability improvements chore(worker,web): Repo indexing stability improvements + perf improvements to web Oct 17, 2025
CLEANUP
}

model RepoJob {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to name this to RepoIndexJob

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: still need to perform the migration

connections RepoToConnection[]
imageUrl String?
/// @deprecated status tracking is now done via the `jobs` table.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder if we can just remove this field all together? Will need to think about the migration story.

@brendan-kellam brendan-kellam marked this pull request as ready for review October 18, 2025 01:22
Copy link

@brendan-kellam your pull request is missing a changelog!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment