Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data dumps feedback #2078

Closed
kornelski opened this issue Dec 30, 2019 · 8 comments · Fixed by #8748
Closed

Data dumps feedback #2078

kornelski opened this issue Dec 30, 2019 · 8 comments · Fixed by #8748
Labels
C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works

Comments

@kornelski
Copy link
Contributor

Overall having a full data access is great! However, the implementation could be optimized:

  1. The tarball format is problematic. It doesn't allow random access, so to extract only interesting parts it's necessary to download and decompress most of it. A ZIP archive or individually gzipped files would be more convenient for selective consumption.

  2. crates.csv is needed to map crate names to crates-io internal IDs, so parsing of this file is required to make sense of the rest of the data. However, this file also contains bodies of README files, which are relatively big. It'd be nice to put READMEs separately. It also has textsearchable_index_col which is postgres-specific and redundant.

  3. version_downloads.csv is the largest file, and it will keep growing. It'd be nice to shard this data by time (e.g. a separate file by year or even by day). I would like to get downloads data daily, but I'd rather download one day of data daily, not all days every day.

@smarnach
Copy link
Contributor

smarnach commented Jan 2, 2020

Thanks a lot for the feedback!

In the first implementation of the database dumps, we planned to support one way of using the data – by importing it back to an empty Postgres database. One of the use cases we want to support is local testing with realistic data during development of crates.io itself. We intentionally chose the CSV export format instead of a format specific to Postgres to allow other uses of the data as well, but we did not put any effort in facilitating other ways of using the data. At least so far – we are happy to make improvements based on your feedback. :)

  1. It should be relatively straight-forward to also provide ZIP archives, or we could switch to ZIP instead of tarballs altogether. So far we advertised the dumps as "experimental", so we are free to make incompatible changes.

    Making indivually gzipped files available adds considerable complexity, since we would need a way to advertise the list of files that comprise the full database dump (which may change over time). Currently we simply have a single fixed location in our S3 bucket that gets overwritten every day.

  2. The files are database dumps. Each file corresponds to a table in the database, and given the other use cases we want to support I don't think we will fundamentally depart from this approach. However, we may mitigate the two issues you mention anyway. First, the readme column in the crates table is likely going to go away – readme files are stored per version rather than per crate. And we can probably simply disable exporting textsearchable_index_col and tell Postgres to regenerate it after import if desired.

  3. I completely agree that it is kind of ridiculous that the version_downloads table account for two thirds of the whole size of the dump, while it stays largely the same every day. Right now I don't have a good idea how to address this without fundamentally changing the way the exports work – maybe we need a second, completely different export format aimed at different use cases?

Could you please explain a bit more in what way you want to use the data? The best way to support lib.rs would be some kind of event stream interface, I guess?

@carols10cents carols10cents added the C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works label Jan 2, 2020
@kornelski
Copy link
Contributor Author

kornelski commented Jan 2, 2020

For lib.rs I'm interested only in data that is not available in .crate archives/Cargo.toml. For example, I don't need dependencies, authors, categories and keywords tables, because I already have that data from the primary source. I almost don't need README files, except for crates that use ../README.md path for the README, because in such case the README file is missing from the .crate archive.

I started by extracting download numbers. Currently I use only aggregated downloads numbers per crate per day, and it's OK if the data is up to a week stale.

I'm planning to use crate ownership data next. The data dump saves a lot of requests to the website API, and it has GitHub IDs (the website API doesn't expose the IDs). However, I'll probably still fall back to the website API for newly published crates, because I want to display correct ownership on crate pages as soon as I create them. Changes in ownership would be useful as some kind of an event stream, since they can happen independently of crate releases.

I'm also interested in knowing who published each version of the crate (not Cargo.toml authors, but GitHub account/token owner), so that I can judge which users are most active.

@smarnach
Copy link
Contributor

smarnach commented Jan 2, 2020

Thanks for the details! I currently don't have much time to work on this, but I plan to do so in the next few months. I hope to get feedback from other people as well – I should probably ask on URLO.

We are currently working on an auditing table for crate owner changes. Maybe we can add some kind of event stream interface for selected database tables. This could cover ownership changes and download counts, which would address two of your points.

The information available via the API and in the database dumps is almost identical, but the database dumps publish a few additional database columns.

The information who published a crate version is available in the crates.io database, but currently it's not made public. I'm open for publishing this data, since it could be valuable for auditing dependencies as well.

@carols10cents
Copy link
Member

The information who published a crate version is available in the crates.io database, but currently it's not made public. I'm open for publishing this data, since it could be valuable for auditing dependencies as well.

I'm confused, versions.published_by is currently public?

@smarnach
Copy link
Contributor

smarnach commented Jan 5, 2020

@carols10cents You are right – it's exposed in both the database dumps and the API (e.g. https://crates.io/api/v1/crates/rand/0.7.1 returns a published_by key). I think I got confused by the unused versions_published_by table.

@Shnatsel
Copy link
Member

I'm interested in using the dumps in https://github.com/rust-secure-code/cargo-supply-chain and we're also facing the issues 1 and 2 from from the original post.

The data we need is just the list of owners for a given crate, 3.5Mb all told. However, we have to download the entire 250Mb of the archive to get to that info, and download+parse a very large file with READMEs in it to get mapping from crate names to crate IDs.

bors added a commit that referenced this issue May 14, 2021
…albini

Drop crates.textsearchable_index_col from the export

This column is postgres specific and is normally populated via a
trigger. The trigger is now enabled during the import so that the column
can be dropped from the export.

This addresses part of what was raised in bullet point 2 of #2078. The large readme column remains because there could be people using that data, but the text search column is redundant.

r? `@smarnach`
cc `@kornelski`
bors added a commit that referenced this issue May 14, 2021
…, r=pietroalbini

Include only the last 90 days of downloads in our database dumps

In #3479 we plan to drop old entries and archive them in some other way, so old entries will eventually disappear from dumps anyway. This should make use of the database dumps much more practical for daily use. I think it would be reasonable to even limit this to the past week of data.

r? `@pietroalbini`
cc `@kornelski,` #2078
@17351269

This comment was marked as off-topic.

@Turbo87
Copy link
Member

Turbo87 commented Jun 10, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants