[API/Worker] - Improve Sync Task Performance #1549

adrian-codecov · 2024-04-10T17:34:35Z

The sync task often takes a long time because it's syncs all repos for all orgs upon a) login or b) pressing the sync button. This also leads to an increase in rate limits. It would be ideal This is an attempt to improve performance altogether.

Some context from a previous conversation:

"
The current REST approach loops around N repos that were synced in the sync_repos task and performs the sync_language task, calling the provider N times - these are repos we guaranteed have on our DB. The main difference with the GQL approach is it uses an owner to call M repos from the provider, where N and M could be different here (N being the repos synced in our DB, M being the repos in the provider). Because of that, we'd also have to make M DB calls to ensure the repo we see in the provider is also in our DB, then we update languages. We'd be trading some rate limit requests with some DB ones, imo not terrible but the task/code would be a bit different. Also would be a special case for gh separate from gl/bb. Not opposed to the refactoring, just that it's not a "replace rest w/ gql" change.

Other interesting things, mostly related to the sync_repos task + consequently to the sync_languages task:

TLDR, the sync_languages task currently gets the affected repoids from the sync_repo task. The sync_repo task has 2 variants: one for owners that have integrations (A), one for owners that don't (B). While they both sync_repos, variant B syncs a list of all repos a user has access to to - as an example, I have access to the repos from these owners (this is local data) {('github', 'adrian-codecov'), ('github', 'aviquez96'), ('github', 'codecov'), ('github', 'temp-org-test-123'), ('github', 'terry-codecov'), ('github', 'Turing-Corp')}. If you dig further into list_repos(), you'll see it behaves differently whether you supply a username, e.g. "codecov", which only looks for repos for that org, vs when you don't, which fetches all the repos from all the owners you have available. Now, I've played a little bit locally, and when you specify a username it's substantially faster; our current implementation I believe does what it does cause back in the day we used to have that "all repos" page right, but we've gotten rid of that and always specify an org, but this change didn't really reflect that. I think this is a HUGE opportunity for performance improvement around the sync task in general, and it might be that we just missed it in the echoes.

To be super explicit, when I click resync here, I'm resyncing for all of these even though I have Codecov selected. That imo can be arguably an undesired behavior
.

Now, maybe syncing "everything" is still a desired functionality, and we can play around that if we want to, but I think this can be an opportunity to optimize calls here. I'm not sure if enterprise customers have only 1 org, in which case this wouldn't help too much, but if they did, that would be $$

Lastly! Referring to variant A of our sync_repo tasks, that syncs based on whether you're using integration, I have reason to believe we're not leveraging that functionality at all. Can expand more on this too, but basically all of the places that call the sync task have using_integration = False afaik
"

Worker

pass specific org to the git_list fns

API

accept params with specific org for select

ajay-sentry mentioned this issue Apr 18, 2024

Sync Task Enhancements #1633

Open

drazisil-codecov mentioned this issue May 28, 2024

Investigate gracefully failing if we encount a 502 from GitHub during sync_repos #1757

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[API/Worker] - Improve Sync Task Performance #1549

[API/Worker] - Improve Sync Task Performance #1549

adrian-codecov commented Apr 10, 2024

[API/Worker] - Improve Sync Task Performance #1549

[API/Worker] - Improve Sync Task Performance #1549

Comments

adrian-codecov commented Apr 10, 2024

Worker

API