Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate gracefully failing if we encount a 502 from GitHub during sync_repos #1757

Closed
drazisil-codecov opened this issue May 16, 2024 · 8 comments · Fixed by codecov/shared#362
Assignees
Labels
bug Something isn't working investigation

Comments

@drazisil-codecov
Copy link

It appears in some cases, GitHub returns a 502 during the pagination calls to fetch repos (not on the first one, not a rate limit)

Th current behavior appears to be to cancel the entire DB transaction. It would be better if we could log the error and commit what when have so far, providing the customer with at least a partial list of repos, instead of an empty one.

@drazisil-codecov drazisil-codecov added the bug Something isn't working label May 16, 2024
@drazisil-codecov
Copy link
Author

#1549 feels related

@heather-codecov
Copy link

One customer had found that complete logout and logging back into GitHub might have helped. (And added clarity that before it was not an issue of authorizing my account to their org - that had been authorized).

@eliatcodecov
Copy link

Changes in order of importance:

  1. change commit logic so we don't throw out entire repo list when we hit a 502.
  2. Test a smaller page size on GitHub API call and determine impact on sync times. If acceptable, consider rollout.

@matt-codecov
Copy link

the 502s are a problem on their own which will hopefully be solved with codecov/shared#232 and codecov/shared#233

as for gracefully failing in sync_repos: we already commit after each repo insert https://github.com/codecov/worker/blob/1beef847cdd951cd084adf4632893c39f4c442e9/tasks/sync_repos.py#L436 so we don't throw repo data away on error. what we can improve is how we update owner.permission, the list of private repos that a user can see. codecov/worker#478 does that

previously, SoftTimeLimitExceeded and TorngitClientError would cause that list to be totally erased, and TorngitServerFailureError (5xx errors) was uncaught so the owner.permission list would be left alone. now, for all three errors, we log the error, note that the list of private repos in owner.permission may be incomplete, and then let the task update it and finish normally.

for SoftTimeLimitExceeded and TorngitClientError, this is better behavior. you can at least see some repos, rather than none. for TorngitServerFailureError, it's kind of a mixed bag. in the old behavior, new repos would not appear but old ones would not disappear (even if they should). in the new behavior, new repos may appear, but old ones may disappear (even if they shouldn't). i think the new behavior is still better on balance

@matt-codecov
Copy link

status update: two of the three PRs shipped, and it seems the problem was solved for the customer who reported tons of 5XXs blocking their usage of the service

the list_repos page size experiment was just updated to 100%, so 50% control and 50% test. no regressions observed so far, so will probably roll out. also no benefits observed, but 5XXs aside from the one customer are very rare so we can't effectively measure benefits anyway

leaving this issue open until the experiment is cleaned up

@matt-codecov
Copy link

aiaiai. finished rolling out the new page size of 50, and then discovered a user for whom 50 is not low enough but 20 is. so, going to set the experiment back up with 50 as the new control and 20 as the new test group. stay tuned...

@thomasrockhu-codecov
Copy link
Contributor

@matt-codecov I think this can be closed now due your investigation? Or is this still an active bug?

@matt-codecov
Copy link

i guess so. github shipped an optimization to the "list repos" endpoint that appeared to solve the problem, and i haven't heard from @vlad-ko that the main affected customer is still having issues. i will get a PR up to clean up the experiment and close this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working investigation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants