-
Notifications
You must be signed in to change notification settings - Fork 602
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate index metadata files from the database #5066
Conversation
There are many reasons why this is a huge step forward, thank you. (This by itself fixes a recurring problem where people download the dump of the database and are surprised that they don't have access to data that only exists in the index.) I do have a concern about changes to the index files, if it's more appropriate in the follow up PR then move it when it is opened. The background context: The index is a shared database, it is read by all versions of Cargo. we have made a number of improvements to make Cargo resilient to changes it does not know how to understand. But the old versions are still in use and still brittle. Specifically old enough versions (pre 1.19) will decide an entire package is unavailable if any of the versions of that package cannot be parsed. This leads to a kind of breakage that so far we have found unacceptable. Old Cargo build the project many years ago, generating a lock file, and this build succeeds. In the intervening years one of this projects dependencies publishes a new version that uses a new syntax in the index file. Old Cargo using the same old lockfile now errors because it cannot parse a version it wasn't using. Se I don't think sorting various fields will cause a problem. Adding |
I re-generated the entire index using only the database and found the differences between the database and git index (excluding the normalization & sorting mentioned in the description). 22 crates had differences. arlosi/crates.io-index@70b7763
|
interesting! that are way less changes than I would have expected :) |
☔ The latest upstream changes (presumably 5fa044c) made this pull request unmergeable. Please resolve the merge conflicts. |
2ea19db
to
8ad38da
Compare
@Turbo87 I broke this PR into multiple commits to make it easier to split out as needed. |
☔ The latest upstream changes (presumably 17fef4c) made this pull request unmergeable. Please resolve the merge conflicts. |
@arlosi btw once we've backfilled the index field in the database I guess it would also be good to expose the new fields in the APi too :) |
☔ The latest upstream changes (presumably 98ea30e) made this pull request unmergeable. Please resolve the merge conflicts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Just 2 minor comments, one of which you've already split off into another PR (so feel free to tweak it here, or just ignore).
b068a80
to
b78d32e
Compare
@Eh2406 Is there a package & recommended pre-1.19 Cargo that I should test this with? I have |
Crates versions published before Feb 2018 omit the When there are no links, should we normalize to The change is trivial: (adding |
I'm convinced that it won't cause a problem. Thanks for the clear proof. |
|
More information can be found at rust-lang/crates.io#5066
More information can be found at rust-lang/crates.io#5066
33a48ab
to
cabdc88
Compare
Let's leave the serialization to the `cargo-registry-index` crate instead.
As mentioned on Zulip, I've taken a look at the proposed changes and implemented some improvements on top:
@arlosi I hope you're okay with these changes. Would be great to get a review from you before I merge this :) |
…` env var is set
…_SYNC` env var is set
…RE_INDEX_SYNC` env var is set This ensures that we not only remove the version from the database, but also from the index.
…_INDEX_SYNC` env var is set This ensures that the crate is deleted from both the git index and the sparse index, and that the CloudFront invalidation is correctly sent out.
…_INDEX_SYNC` env var is set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your changes look good to me. We'll want to do a follow up PR to clean up the old code path once it's been running successfully on prod for a while.
When we enable this on prod:
- This will normalize any crates that aren't already normalized if they're changed (new version, yank/unyank).
- Currently this is around 9000 crates
- I think this should be fine. It will spread the normalization out over time and avoid any large normalization commit. The number of denormalized crates will shrink over time and we can do a bulk normalization to get the remainder in a few months.
- It will resolve the dependency index/db differences mentioned in the Zulip thread if they're changed.
- We should merge my commit that resolves these before enabling this on prod
(Some(old), Some(new)) if old != new => { | ||
let mut file = File::create(&dst)?; | ||
file.write_all(new.as_bytes())?; | ||
repo.commit_and_push(&format!("Updating crate `{}`", krate), &dst)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will produce less specific git commit messages than we currently use.
It's probably fine (and it makes the code simpler), but I wanted to make sure we're OK with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will produce less specific git commit messages than we currently use.
Yep, but I figured since we can't be sure if we only push a single change or not we may as well use a more generic message.
When a crate is modified, rather than editing the existing index file in git, this change re-generates the entire file from the database. This makes the jobs idempotent and prevents the index from getting out of sync with the database.
The
delete_version
anddelete_crate
admin tasks now also modify the index.Test file changes are caused because the tests were inserting versions into the DB without adding them to the index.