Data dumps feedback

Overall having a full data access is great! However, the implementation could be optimized:

1. The tarball format is problematic. It doesn't allow random access, so to extract only interesting parts it's necessary to download and decompress most of it. A ZIP archive or individually gzipped files would be more convenient for selective consumption.

2. `crates.csv` is needed to map crate names to crates-io internal IDs, so parsing of this file is required to make sense of the rest of the data. However, this file also contains bodies of README files, which are relatively big. It'd be nice to put READMEs separately. It also has `textsearchable_index_col` which is postgres-specific and redundant.

3. `version_downloads.csv` is the largest file, and it will keep growing. It'd be nice to shard this data by time (e.g. a separate file by year or even by day). I would like to get downloads data daily, but I'd rather download one day of data daily, not all days every day.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Data dumps feedback #2078

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Data dumps feedback #2078

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions