Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crate fetching is slow #45

Open
Owez opened this issue Feb 24, 2021 · 9 comments
Open

Crate fetching is slow #45

Owez opened this issue Feb 24, 2021 · 9 comments

Comments

@Owez
Copy link
Contributor

Owez commented Feb 24, 2021

I'm running a test run using the crates command and as I'm writing it's fetching crates. It seems to take quite a long time, even with a relatively low dependency crate such as cargo-supply-chain (only 79 deps listed), as seen here:

image

It would be useful and more functional if it fetched these in parallel, especially due to the high latency with the crates.io api.

@HeroicKatora
Copy link
Collaborator

This is by design. There is no high latency, it's just being cordial to crates.io as they suggest a 2 second delay between individual requests and there is no endpoint for performing multiple queries in the same request. I suppose we could clarify this in the output some more but 'This will take roughly 2 seconds per crate due to API rate limits' seemed clear enough.

@Shnatsel Would you vote against a flag that would allow overriding this limit? I'm cautious and probably wouldn't add it but it might be 'fine' if we also enforced that one sets a custom user agent in these situations. (Similar to how nmap allows overriding the default rate to ridiculously high rates if you actually want to barrage your local network but you need two flags and know what you're doing). And we should ask the crates team if doing it once for testing would be of concern to them, or if they could offer a bulk endpoint.

@Shnatsel
Copy link
Member

We would need to clear higher request rate with the crates.io team. Right now the documentation explicitly mentions one request per second for crawlers but says nothing about other uses. I believe docs.rs makes much higher request rates. I have asked in the crates.io Discord once, but received no reply.

Right now you can speed up this process by running cargo supply-chain update. After that all operations will be performed immediately, but may be out of date by up to 48 hours.

@Shnatsel
Copy link
Member

@Owez could you post the output of time cargo supply-chain crates so we could see if it's abnormally high? 80 * 2 = 160 seconds is the expected time to fetch the live data.

@Owez
Copy link
Contributor Author

Owez commented Feb 24, 2021

0.22s user 0.04s system 0% cpu 2:37.31 total

My time seems normal and does feel ~2 secs delay. Without a bulk api the delay seems unfortunately borderline unusable, especially on a crate with a larger dependency count. Could automatic non-update caching be used and deleted if >24hr for now to have a balanced workaround then using the whole crates.io index?

@Shnatsel
Copy link
Member

Such granular caching would add complexity, and I am not eager to do so. Although I fear we will have to go down that road sooner or later, at least because people will likely want to view both crates and publishers for the same crate and it would make no sense to re-download the data.

I see two ways we could solve this without going for a crate-granular caching system:

  1. We could try to contact the crates.io team again and ask what the rate limits for this tool should be, since they only have rate limits specified for crawlers (by which we abide).
  2. The file we need is only ~10Mb, but it is stuffed into a ~250Mb monolithic archive, and we have to download the entire 250Mb. Splitting up the archive would reduce the required amount of traffic for the update command 25x-100x.

@Shnatsel Shnatsel changed the title Async/parallel crate fetching Crate fetching is slow Feb 24, 2021
@HeroicKatora
Copy link
Collaborator

We could estimate the time for fetching them one-by-one and the time for downloading the complete dump, and adaptively choose to refresh the cache. I'm slightly concerned how this would affect script behavior that wants to rely on either of the two behaviors.

@tarcieri
Copy link
Member

tarcieri commented Feb 24, 2021

The file we need is only ~10Mb, but it is stuffed into a ~250Mb monolithic archive, and we have to download the entire 250Mb. Splitting up the archive would reduce the required amount of traffic for the update command 25x-100x.

Could an ETL job work here? If so, you can probably use GitHub Actions for this in a completely automated manner by periodically running a job that commits to a git repo like this:

https://github.com/RustSec/advisory-db/blob/master/.github/workflows/publish-web.yml

That could either be fetched via git (where the periodically running job commits deltas from the previous state), or served over HTTP via GitHub Pages

@HeroicKatora
Copy link
Collaborator

HeroicKatora commented Feb 24, 2021

I don't think we want to keep old files around, as that is just duplicate content. However, it does becomes more attractive after thinking about it for a while. In particular, we could use a Github repo or just a git branch that has its version history regularly squashed. This would mitigate the growth and give Github a chance to remove old files (otherwise we'd already be at a few gigabytes after a year). This would also allow us to use freely available CDNs for Github files.

@tarcieri
Copy link
Member

GitHub Pages will give you a CDN for free.

I was suggesting that if you regenerated a text file (or multiple text files, ideally) you wouldn't need to leave "old files" around as the repo would be updated with the parts that changed.

If you don't want/care about that, you could just force push every time, with GitHub Pages (or even just https://raw.githubusercontent.com/) serving the content over HTTP(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants