-
Notifications
You must be signed in to change notification settings - Fork 619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data dumps feedback #2078
Comments
Thanks a lot for the feedback! In the first implementation of the database dumps, we planned to support one way of using the data – by importing it back to an empty Postgres database. One of the use cases we want to support is local testing with realistic data during development of crates.io itself. We intentionally chose the CSV export format instead of a format specific to Postgres to allow other uses of the data as well, but we did not put any effort in facilitating other ways of using the data. At least so far – we are happy to make improvements based on your feedback. :)
Could you please explain a bit more in what way you want to use the data? The best way to support lib.rs would be some kind of event stream interface, I guess? |
For lib.rs I'm interested only in data that is not available in I started by extracting download numbers. Currently I use only aggregated downloads numbers per crate per day, and it's OK if the data is up to a week stale. I'm planning to use crate ownership data next. The data dump saves a lot of requests to the website API, and it has GitHub IDs (the website API doesn't expose the IDs). However, I'll probably still fall back to the website API for newly published crates, because I want to display correct ownership on crate pages as soon as I create them. Changes in ownership would be useful as some kind of an event stream, since they can happen independently of crate releases. I'm also interested in knowing who published each version of the crate (not Cargo.toml authors, but GitHub account/token owner), so that I can judge which users are most active. |
Thanks for the details! I currently don't have much time to work on this, but I plan to do so in the next few months. I hope to get feedback from other people as well – I should probably ask on URLO. We are currently working on an auditing table for crate owner changes. Maybe we can add some kind of event stream interface for selected database tables. This could cover ownership changes and download counts, which would address two of your points. The information available via the API and in the database dumps is almost identical, but the database dumps publish a few additional database columns. The information who published a crate version is available in the crates.io database, but currently it's not made public. I'm open for publishing this data, since it could be valuable for auditing dependencies as well. |
I'm confused, versions.published_by is currently public? |
@carols10cents You are right – it's exposed in both the database dumps and the API (e.g. https://crates.io/api/v1/crates/rand/0.7.1 returns a |
I'm interested in using the dumps in https://github.com/rust-secure-code/cargo-supply-chain and we're also facing the issues 1 and 2 from from the original post. The data we need is just the list of owners for a given crate, 3.5Mb all told. However, we have to download the entire 250Mb of the archive to get to that info, and download+parse a very large file with READMEs in it to get mapping from crate names to crate IDs. |
…albini Drop crates.textsearchable_index_col from the export This column is postgres specific and is normally populated via a trigger. The trigger is now enabled during the import so that the column can be dropped from the export. This addresses part of what was raised in bullet point 2 of #2078. The large readme column remains because there could be people using that data, but the text search column is redundant. r? `@smarnach` cc `@kornelski`
…, r=pietroalbini Include only the last 90 days of downloads in our database dumps In #3479 we plan to drop old entries and archive them in some other way, so old entries will eventually disappear from dumps anyway. This should make use of the database dumps much more practical for daily use. I think it would be reasonable to even limit this to the past week of data. r? `@pietroalbini` cc `@kornelski,` #2078
This comment was marked as off-topic.
This comment was marked as off-topic.
|
Overall having a full data access is great! However, the implementation could be optimized:
The tarball format is problematic. It doesn't allow random access, so to extract only interesting parts it's necessary to download and decompress most of it. A ZIP archive or individually gzipped files would be more convenient for selective consumption.
crates.csv
is needed to map crate names to crates-io internal IDs, so parsing of this file is required to make sense of the rest of the data. However, this file also contains bodies of README files, which are relatively big. It'd be nice to put READMEs separately. It also hastextsearchable_index_col
which is postgres-specific and redundant.version_downloads.csv
is the largest file, and it will keep growing. It'd be nice to shard this data by time (e.g. a separate file by year or even by day). I would like to get downloads data daily, but I'd rather download one day of data daily, not all days every day.The text was updated successfully, but these errors were encountered: