Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue for rework of rust-repos scraper #121

Open
3 tasks
NULLx76 opened this issue Oct 15, 2024 · 2 comments
Open
3 tasks

Tracking issue for rework of rust-repos scraper #121

NULLx76 opened this issue Oct 15, 2024 · 2 comments

Comments

@NULLx76
Copy link

NULLx76 commented Oct 15, 2024

Last year, I used this repository as part of my research of analysing release practices of all Java repositories on GitHub. During this, I discovered that this repository had a few issues, partially to just not being updated in a while. I hope it is not too presumptuous of me to suggest a rework, but I think it could be a nice thing to do and am willing to take it on myself.

This is a tracking issue, documenting all the things I've found (and still remember).
When I encounter/remember more, I'll add them to this issue.

  • Rate Limits not being respected, Rate Limits not being respected #122
  • Retrying is not done correctly (sort of related to rate limits)
  • Outdated Dependencies and Rust Edition
    • This also includes using libraries like failure which are deprecated

The final scraper I have implemented for Java can be found here, specifically in src/scraper. I'd mostly want to port that code to rust-repos as I've verified it to work and should be mostly applicable.

A natural issue I ran into when scraping millions of repositories is that it can take weeks to scrape all of GitHub when respecting the rate-limits (while using some tricks even).
There are different solutions to this, but importantly it is good to find out how much of an issue this is with Rust, as there are far fewer repositories than Java. This is also related to #65, in its current state it may simply not be feasible to do that, but I can look into if it is.

@NULLx76
Copy link
Author

NULLx76 commented Oct 15, 2024

@rustbot claim

@rustbot
Copy link

rustbot commented Oct 15, 2024

Error: This repository is not enabled to use triagebot.
Add a triagebot.toml in the root of the default branch to enable it.

Please file an issue on GitHub at triagebot if there's a problem with this bot, or reach out on #t-infra on Zulip.

@NULLx76 NULLx76 mentioned this issue Oct 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants