Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate index metadata files from the database #5066

Merged
merged 15 commits into from
Apr 21, 2023

Conversation

arlosi
Copy link
Contributor

@arlosi arlosi commented Aug 10, 2022

When a crate is modified, rather than editing the existing index file in git, this change re-generates the entire file from the database. This makes the jobs idempotent and prevents the index from getting out of sync with the database.

The delete_version and delete_crate admin tasks now also modify the index.

Test file changes are caused because the tests were inserting versions into the DB without adding them to the index.

@Turbo87 Turbo87 added C-internal 🔧 Category: Nonessential work that would make the codebase more consistent or clear A-backend ⚙️ labels Aug 11, 2022
@Eh2406
Copy link
Contributor

Eh2406 commented Aug 13, 2022

There are many reasons why this is a huge step forward, thank you. (This by itself fixes a recurring problem where people download the dump of the database and are surprised that they don't have access to data that only exists in the index.)
I cannot speak to the merge-ability of this PR as I am not a maintainer of crates.io.

I do have a concern about changes to the index files, if it's more appropriate in the follow up PR then move it when it is opened. The background context: The index is a shared database, it is read by all versions of Cargo. we have made a number of improvements to make Cargo resilient to changes it does not know how to understand. But the old versions are still in use and still brittle. Specifically old enough versions (pre 1.19) will decide an entire package is unavailable if any of the versions of that package cannot be parsed. This leads to a kind of breakage that so far we have found unacceptable. Old Cargo build the project many years ago, generating a lock file, and this build succeeds. In the intervening years one of this projects dependencies publishes a new version that uses a new syntax in the index file. Old Cargo using the same old lockfile now errors because it cannot parse a version it wasn't using. Se features2 for the lengths we have gone to to avoid this breakage.

I don't think sorting various fields will cause a problem. Adding "links": null, will hopefully not cause a problem but should probably be checked, or we can use #[serde(skip_serializing_if = "Option::is_none")] to only write it where it has a value. "kind": null and "features": [""] are almost certainly OK, but also should be investigated. Just in case.

@arlosi
Copy link
Contributor Author

arlosi commented Aug 15, 2022

I re-generated the entire index using only the database and found the differences between the database and git index (excluding the normalization & sorting mentioned in the description). 22 crates had differences.

arlosi/crates.io-index@70b7763

  • Several differences on yanked status
  • A few duplicate entries removed
  • Mixup between _ and - on glib crate?
  • A few changes in deps

@Turbo87
Copy link
Member

Turbo87 commented Aug 15, 2022

interesting! that are way less changes than I would have expected :)

@bors
Copy link
Contributor

bors commented Aug 16, 2022

☔ The latest upstream changes (presumably 5fa044c) made this pull request unmergeable. Please resolve the merge conflicts.

@arlosi
Copy link
Contributor Author

arlosi commented Aug 18, 2022

@Turbo87 I broke this PR into multiple commits to make it easier to split out as needed.

@bors
Copy link
Contributor

bors commented Aug 20, 2022

☔ The latest upstream changes (presumably 17fef4c) made this pull request unmergeable. Please resolve the merge conflicts.

src/admin/import_cksum.rs Outdated Show resolved Hide resolved
@Turbo87
Copy link
Member

Turbo87 commented Aug 22, 2022

@arlosi btw once we've backfilled the index field in the database I guess it would also be good to expose the new fields in the APi too :)

@bors
Copy link
Contributor

bors commented Aug 22, 2022

☔ The latest upstream changes (presumably 98ea30e) made this pull request unmergeable. Please resolve the merge conflicts.

Copy link
Member

@jtgeibel jtgeibel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Just 2 minor comments, one of which you've already split off into another PR (so feel free to tweak it here, or just ignore).

src/admin/import_cksum.rs Outdated Show resolved Hide resolved
src/models/dependency.rs Outdated Show resolved Hide resolved
@arlosi arlosi force-pushed the db-index branch 2 times, most recently from b068a80 to b78d32e Compare August 24, 2022 21:02
@arlosi
Copy link
Contributor Author

arlosi commented Aug 24, 2022

@Eh2406 Is there a package & recommended pre-1.19 Cargo that I should test this with?

I have https://github.com/arlosi/crates.io-index set up as a database-regenerated index from last-night's DB dump, so it should be straightforward to validate with old cargo and a source-replacement of crates.io

@arlosi arlosi changed the title Add additional fields such that the index can be generated from the database Generate index metadata files from the database Aug 25, 2022
@arlosi
Copy link
Contributor Author

arlosi commented Aug 25, 2022

Crates versions published before Feb 2018 omit the links field, while crates published after have links: null, so Cargo clearly supports both forms.

When there are no links, should we normalize to links: null, or omit the field? cc @Eh2406

The change is trivial: (adding #[serde(skip_serializing_if = "Option::is_none")])

@Eh2406
Copy link
Contributor

Eh2406 commented Aug 25, 2022

I'm convinced that it won't cause a problem. Thanks for the clear proof.
If it does not matter, we may as well ship less data. Lets omit.

@Turbo87
Copy link
Member

Turbo87 commented Aug 25, 2022

git-import ran through without any errors this time. database should now be up-to-date unless we missed something in the import script :)

pietroalbini pushed a commit to rust-lang/crates.io-index that referenced this pull request Jan 13, 2023
More information can be found at rust-lang/crates.io#5066
pietroalbini pushed a commit to rust-lang/crates.io-index that referenced this pull request Jan 13, 2023
More information can be found at rust-lang/crates.io#5066
@arlosi arlosi force-pushed the db-index branch 3 times, most recently from 33a48ab to cabdc88 Compare April 18, 2023 19:41
@Turbo87
Copy link
Member

Turbo87 commented Apr 21, 2023

As mentioned on Zulip, I've taken a look at the proposed changes and implemented some improvements on top:

  • the index_metadata() method now returns an unserialized Vec<Crate>, which makes it easier to test via insta
  • the cargo-registry-index crate contains the necessary code to serialize a Vec<Crate> to an index file
  • the index_metadata() method sorts by created_at (for production) before falling back to semver sorting (for tests, because all versions in a test db transaction will unfortunately end up with the same created_at)
  • the index_metadata() method will fetch all dependencies of all relevant versions in a single database query instead of running one query per version
  • the sync background job has been split up into one job per index (git and sparse). without this, we could end up in a situation where the sparse index update fails, causing the job to be retried, but the git index check will see that both new and old file content already match and skip the sparse index update too.
  • I've added a FEATURE_INDEX_SYNC environment variable to turn the new behavior on and off, which should allow us to merge this but only enable it on staging for now
  • the above required adding new background jobs instead of changing the existing ones
  • finally, I've split the work up into a couple more commits to make it a bit easier to review :)

@arlosi I hope you're okay with these changes. Would be great to get a review from you before I merge this :)

Turbo87 and others added 4 commits April 21, 2023 18:42
…RE_INDEX_SYNC` env var is set

This ensures that we not only remove the version from the database, but also from the index.
…_INDEX_SYNC` env var is set

This ensures that the crate is deleted from both the git index and the sparse index, and that the CloudFront invalidation is correctly sent out.
Copy link
Contributor Author

@arlosi arlosi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your changes look good to me. We'll want to do a follow up PR to clean up the old code path once it's been running successfully on prod for a while.

When we enable this on prod:

  • This will normalize any crates that aren't already normalized if they're changed (new version, yank/unyank).
    • Currently this is around 9000 crates
    • I think this should be fine. It will spread the normalization out over time and avoid any large normalization commit. The number of denormalized crates will shrink over time and we can do a bulk normalization to get the remainder in a few months.
  • It will resolve the dependency index/db differences mentioned in the Zulip thread if they're changed.
    • We should merge my commit that resolves these before enabling this on prod

(Some(old), Some(new)) if old != new => {
let mut file = File::create(&dst)?;
file.write_all(new.as_bytes())?;
repo.commit_and_push(&format!("Updating crate `{}`", krate), &dst)?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will produce less specific git commit messages than we currently use.

It's probably fine (and it makes the code simpler), but I wanted to make sure we're OK with that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will produce less specific git commit messages than we currently use.

Yep, but I figured since we can't be sure if we only push a single change or not we may as well use a more generic message.

src/admin/delete_crate.rs Outdated Show resolved Hide resolved
@Turbo87 Turbo87 enabled auto-merge April 21, 2023 19:33
@Turbo87 Turbo87 linked an issue Apr 21, 2023 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-backend ⚙️ C-internal 🔧 Category: Nonessential work that would make the codebase more consistent or clear
Projects
None yet
Development

Successfully merging this pull request may close these issues.

embedded-hal-nb v1.0.0-alpha.2 missing from index.
8 participants