Parse less JSON on null builds #6880

alexcrichton · 2019-04-26T01:07:27Z

This commit fixes a performance pathology in Cargo today. Whenever Cargo
generates a lock file (which happens on all invocations of cargo build
for example) Cargo will parse the crates.io index to learn about
dependencies. Currently, however, when it parses a crate it parses the
JSON blob for every single version of the crate. With a lock file,
however, or with incremental builds only one of these lines of JSON is
relevant. Measured today Cargo building Cargo parses 3700 JSON
dependencies in the registry.

This commit implements an optimization that brings down the number of
parsed JSON lines in the registry to precisely the right number
necessary to build a project. For example Cargo has 150 crates in its
lock file, so now it only parses 150 JSON lines (a 20x reduction from
3700). This in turn can greatly improve Cargo's null build time. Cargo
building Cargo dropped from 120ms to 60ms on a Linux machine and 400ms
to 200ms on a Mac.

The commit internally has a lot more details about how this is done but
the general idea is to have a cache which is optimized for Cargo to read
which is maintained automatically by Cargo.

Closes #6866

rust-highfive · 2019-04-26T01:07:30Z

r? @Eh2406

(rust_highfive has picked a reviewer for you, use r? to override)

alexcrichton · 2019-04-26T01:08:03Z

This isn't 100% ready to go yet since it still doesn't handle concurrent writes into the global cache, but I wanted to put this up for initial thoughts if there were any! I hope to fix the global cache write synchronization tomorrow

Eh2406 · 2019-04-26T03:52:32Z

This is realy cool! I will have to revue when I am more awake. Overall I wonder how can we test this well. How can we be absolutely sure the there are no race conditions, or other bugs, leading to things getting out of sink? Now and in the future? Similarly for making sure that the files work across cargo versions.

alexcrichton · 2019-04-26T15:52:16Z

Heh that's a good question! I don't think we can really be sure that race conditions/bugs are gone related to that ever. It's largely I think about how we architect locking and evaluate it if it feels as foolproof as possible. I had an idea this morning I'm going to toy which that I'm pretty confident in, but really we can only get but so far here.

In terms of working with Cargo against future versions it's sort of the same, I'm trying to be very liberal with ignoring errors and proactive with some degree of versioning, but in reality there's really only so much we can do against this I think.

alexcrichton · 2019-04-26T18:53:48Z

Ok I've updated with a strategy to lock the index and ensure that concurrent updates work ok, even with this new caching strategy. The new locking strategy is to basically just not have granular locks and instead have one large global lock protecting all of resolve, for example. This is done to avoid us having to worry about all these concurrent updates, and it in theory isn't any loss in functionality either.

alexcrichton · 2019-04-26T18:54:03Z

@ehuss you might be intersted in the locking commit as well

Eh2406 · 2019-04-26T21:10:41Z

I did not grock this yet, but some thoughts:

Can we make CURRENT_CACHE_VERSION part of the file path so we reduce the chance of cross talk? (or has this already been done.) Actually, where are the cache file stored?
Should we have a debug assertion that the cached version matches the canonical one?
Can we have test that the Cargo that was used to build the tests is compatible with the Cargo being tested? (This would be good for lock files as well.) Set up a registry, have the host Cargo make its cache files, have the test Cargo run to see if it correctly uses or ignores the files, have the host Cargo run to make sure we don't brake.
Can we have test that the host Cargo and the test Cargo, can run concurrently without messing up each others lock? BTW what happens if new Cargo gets a new course lock while an old Cargo has a granular lock?
Given that the registry is a Git project can we use the Hash of head instead of mtime? (mtime, often does weird things)

alexcrichton · 2019-04-29T15:12:49Z

I'm personally not really super concerned about cross-cargo-version issues here. I think we need to at a bare minimum ensure it works (nothing gets corrupted across Cargo invocations), but other than that I feel like it's a bit much for us to maintain anything else. For example I don't think we need to optimize for the use case where you oscillate between Cargo versions and it might thrash the cache that we're building here. The purpose of this PR is to reduce the overhead of Cargo as much as possible on incremental builds, and part of the incremental aspect is not changing Cargo that much!

In that sense I could make the version part of the path for sure and we could reduce cache thrashing, but I don't think it's too too important here. Additionally while it's happened to work in the past I don't think we should strive to say that concurrent invocations of different versions of Cargo are supposed to work (rather only concurrent invocations of the same version are expected to work).

I like the idea of a debug assertion and using the git hash instead of the mtime, I'll look to implement those later when this is closer to being god to go!

For testing, I'm not sure how we'd manage that unfortunately. We can't really rely on the host Cargo to be any particular version so the tests would already have to be really loose. It may be best to just unit-test the code in question and make sure that error handling is as conservative as possible, since I'm not sure how to best test these things (but I think it's pretty minor)

bors · 2019-04-29T15:19:09Z

☔ The latest upstream changes (presumably #6871) made this pull request unmergeable. Please resolve the merge conflicts.

Eh2406 · 2019-04-30T03:01:57Z

src/cargo/core/mod.rs

@@ -14,6 +14,7 @@ pub use self::shell::{Shell, Verbosity};
 pub use self::source::{GitReference, Source, SourceId, SourceMap};
 pub use self::summary::{FeatureMap, FeatureValue, Summary};
 pub use self::workspace::{Members, Workspace, WorkspaceConfig, WorkspaceRootConfig};
+pub use self::interning::InternedString;


There are a lot of Cow<'_, str> that can be replaced with InternedString now that it is publick.

Yeah that was one thing I was going to try if JSON parsing still showed up in the profile, but after this PR the JSON parsing disappeared so I think it's less pressing to do that just yet (can be a follow-up of course!)

Eh2406 · 2019-04-30T03:14:10Z

I did a more indepth review it looks good, and I really like it!
Three fundamental questions:

1. Why do the processing on the client side, why not have the index just write the better format in the first place? (Presumably this is a smaller change, givin now that custom registries are stable.)
1. If we are building a custom format why stay with JSON? (Presumably we can always experiment with changing this after this lands, so start with the smallest change.
1. Lots of small files, how does this perform on windows... (Presumably you want me to answer that)

alexcrichton · 2019-04-30T14:25:33Z

Good questions!

Why do the processing on the client side, why not have the index just write the better format in the first place? (Presumably this is a smaller change, givin now that custom registries are stable.)

For doing this on the client side rather than the index itself, I think the main reason is feasibility. I've long figured that the index will not satisfy Cargo until the end of time and we'd need an even faster indexing format at some point. Having the capability of a local cache managed by Cargo allows us to divorce these two aspects. The index is primarily focused on optimizing for delta updates (making it super easy and fast to incrementally update it), whereas Cargo's problem is different where it's effectively requiring random access to the index. I think that if this is the only change we make to Cargo's internal format for a long time to come, but otherwise I think it's inevitable that Cargo's own format for the index on disk is divorced from the index's upstream format.

Frankly though for the index format it's just way easier to do it in Cargo. The amount of effort needed to change the index itself and make sure everything doesn't break means that this probably wouldn't get done.

If we are building a custom format why stay with JSON? (Presumably we can always experiment with changing this after this lands, so start with the smallest change.

Heh another good question! I wasn't sure whether JSON would still be slow, so this is where I let the profiles guide me. It was obvious from before that parsing thousands of lines of JSON took hundreds of milliseconds and the easiest win was to simply not parse thousands of lines but only the handful needed. Since then I haven't seen JSON parsing in the profile.

We could of course, however, change the format of the cache files at any time. That's the point of the local cache for Cargo :). If even JSON is too slow we'll have to carefully design a new format to ensure it preserves all the relevant information, but it's certainly possible to do so.

Lots of small files, how does this perform on windows...

Ah yeah unfortunately I don't have access to my Windows machine right now to test this out. Previously we were reading one big git file but now we're reading a lot of little files around the filesystem, so I'm honestly not entirely sure what the performance is. It's a good point though and something we should measure before landing. Would you be up for helping me out with measurements?

Eh2406 · 2019-04-30T19:12:50Z

I ran a number of commands with two versions of cargo. One is from the head of this PR (25ee430d41cb48be76d31747847e007db614b234) the other from after the small optimizations (319e9bb09bdb8003310ed756ac627235bf46af1a)
both rebase on master (af1fcb3). Both build locally with --release from rustc 1.36.0-nightly (6d599337f 2019-04-22). The script ran each combination once outside the timing loop then calculated the wall time to run the command 15 times in the Cargo project.

for command in [["update"], "update -p hex".split(), ["generate-lockfile"], ["build"]]:
    for cargo in ["PR", "Master"]:
        p1 = subprocess.Popen(['speed/cargo-' + cargo, command, "-Zno-index-update"],
                              stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        p1.communicate()
        start = time.clock()
        n = 15
        for i in range(n):
            p1 = subprocess.Popen(['speed/cargo-' + cargo] + command + ["-Zno-index-update"],
                                  stdout=subprocess.PIPE, stderr=subprocess.PIPE)
            p1.communicate()
        print command, cargo, (time.clock() - start) / n

The table below shows the average wall time in seconds for each combination.

Edit: looks like you need to update the index to get the new speeds. So here are the numbers with the same artifacts and commands after I once ran a command without "-Zno-index-update",

Edit: the command for update -p hex was malformed.

command	Master	PR	%change
update	0.27955078	0.195570633333	30%
update -p hex	0.266721386667	0.151231233333	43%
generate-lockfile	0.266229426667	0.171039146667	35%
build	0.714220346667	0.711805573333	0.3%

alexcrichton · 2019-04-30T21:41:00Z

Ok I've rebased and pushed up a commit which uses git sha information instead of mtime information which should be more robust, as well as an additional commit which adds a debug assertion that if we think the cache is fresh it actually is.

src/cargo/sources/registry/index.rs

Eh2406 · 2019-05-01T00:40:08Z

New timings are:

command	Master	PR	%change
update	0.26499556	0.183760626667	30.6%
update -p hex	0.247472593333	0.13809348	44.2%
generate-lockfile	0.249164073333	0.16347192	34.4%
build	0.6984459	0.69787098	0.1%

alexcrichton · 2019-05-01T14:32:25Z

Updated!

Eh2406 · 2019-05-01T14:51:04Z

Ok this looks good! Other improvements can always be done in follow up PRs.

I keep getting distracted by changes that may be small improvements, when according to my profiling the only thing that matters is a design that uses fewer files.

Eh2406 · 2019-05-01T21:50:36Z

So here is a fleshed out version of my 3 file straw man.

index.json is a uncompressed concatenated version of all of the index files we have actually read. In the existing format.
versions.lookup is an fast way to look up where in index.json the row is for each version. It would have the format version-string\0start\0end\n.
names.lookup is an fast way to look up where in versions.lookup are the rows for each name. It would have the format name\0start\0end\n.

to do a search:

read in and parce all of names.lookup. If the name we want is not in the then read it from the raw index and append to the files.
read in all of versions.lookup, but only parce the part of the file that the names.lookup told us to.
read in all of index.json, but only parce the part of the file that the versions.lookup told us to.

when we pull a new version of the index dell the files.

"read in all of" can probably be mmap, if that is a bottleneck.

Also we should look if there are existing libraries for on disk indexing into a file... https://crates.io/crates/csv-index mabey? actually that does something very similar.

bors · 2019-05-01T22:56:13Z

☔ The latest upstream changes (presumably #6896) made this pull request unmergeable. Please resolve the merge conflicts.

This was currently getting executed on all builds, even if the directory already exists. There shouldn't be any reason though to exclude the directory from backups on all builds, and after seeing this get a stack sample in a profile I figured it's best to ensure it only executes once in case the backing system implementation isn't the speediest.

This gets called quite a lot and doesn't need to allocate in the first place!

This commit fixes a performance pathology in Cargo today. Whenever Cargo generates a lock file (which happens on all invocations of `cargo build` for example) Cargo will parse the crates.io index to learn about dependencies. Currently, however, when it parses a crate it parses the JSON blob for every single version of the crate. With a lock file, however, or with incremental builds only one of these lines of JSON is relevant. Measured today Cargo building Cargo parses 3700 JSON dependencies in the registry. This commit implements an optimization that brings down the number of parsed JSON lines in the registry to precisely the right number necessary to build a project. For example Cargo has 150 crates in its lock file, so now it only parses 150 JSON lines (a 20x reduction from 3700). This in turn can greatly improve Cargo's null build time. Cargo building Cargo dropped from 120ms to 60ms on a Linux machine and 400ms to 200ms on a Mac. The commit internally has a lot more details about how this is done but the general idea is to have a cache which is optimized for Cargo to read which is maintained automatically by Cargo. Closes rust-lang#6866

alexcrichton · 2019-05-06T15:41:30Z

I've opened #6908 to track your suggestion @Eh2406

Update cargo 12 commits in beb8fcb5248dc2e6aa488af9613216d5ccb31c6a..759b6161a328db1d4863139e90875308ecd25a75 2019-04-30 23:58:00 +0000 to 2019-05-06 20:47:49 +0000 - Small things (rust-lang/cargo#6910) - Fix skipping over invalid registry packages (rust-lang/cargo#6912) - Fixes rust-lang/cargo#6874 (rust-lang/cargo#6905) - doc: Format examples of version to ease reading (rust-lang/cargo#6907) - fix more typos (codespell) (rust-lang/cargo#6903) - Parse less JSON on null builds (rust-lang/cargo#6880) - chore: Update opener to 0.4 (rust-lang/cargo#6902) - Update documentation for auto-discovery. (rust-lang/cargo#6898) - Update some doc links. (rust-lang/cargo#6897) - Default Cargo.toml template provide help for completing the metadata (rust-lang/cargo#6881) - Run 'cargo fmt --all' (rust-lang/cargo#6896) - Refactor command definition (rust-lang/cargo#6894)

Previously Cargo would attempt to work as much as possible with a previously filled out CARGO_HOME, even if it was mounted as read-only. In rust-lang#6880 this was regressed as a few global locks and files were always attempted to be opened in writable mode. This commit fixes these issues by correcting two locations: * First the global package cache lock has error handling to allow acquiring the lock in read-only mode inaddition to read/write mode. If the read/write mode failed due to an error that looks like a readonly filesystem then we assume everything in the package cache is readonly and we switch to just acquiring any lock, this time a shared readonly one. We in theory aren't actually doing any synchronization at that point since it's all readonly anyway. * Next when unpacking package we're careful to issue a `stat` call before opening a file in writable mode. This way our preexisting guard to return early if a package is unpacked will succeed before we open anything in writable mode. Closes rust-lang#6928

Re-enable compatibility with readonly CARGO_HOME Previously Cargo would attempt to work as much as possible with a previously filled out CARGO_HOME, even if it was mounted as read-only. In #6880 this was regressed as a few global locks and files were always attempted to be opened in writable mode. This commit fixes these issues by correcting two locations: * First the global package cache lock has error handling to allow acquiring the lock in read-only mode inaddition to read/write mode. If the read/write mode failed due to an error that looks like a readonly filesystem then we assume everything in the package cache is readonly and we switch to just acquiring any lock, this time a shared readonly one. We in theory aren't actually doing any synchronization at that point since it's all readonly anyway. * Next when unpacking package we're careful to issue a `stat` call before opening a file in writable mode. This way our preexisting guard to return early if a package is unpacked will succeed before we open anything in writable mode. Closes #6928

This fixes an accidental regression from rust-lang#6880 identified in rust-lang#7189 by moving where the configuration of backup preferences happens since it was accidentally never happening due to the folder always having been created.

Fix excluding target dirs from backups on OSX This fixes an accidental regression from #6880 identified in #7189 by moving where the configuration of backup preferences happens since it was accidentally never happening due to the folder always having been created. Closes #7189

don't need to copy this string This removes a `String::clone` that I noticed when profiling no-op builds of cargo, benchmarks show a barely visible improvement. Looks like it was added in #6880, but I am not sure why.

`map_dependencies` is doing a deep clone, so lets make it cheaper This removes a `FeatureMap::clone` that I noticed when profiling no-op builds of cargo, benchmarks show a ~5% improvement. Looks like #6880 means that there is a ref to every `Summery` so the `Rc::make_mut` dose a deep clone.

Remove the `git-checkout` subcommand. This command has been broken for almost a year (since #6880), and nobody has mentioned it. The command isn't very useful (it checks out into cargo's `db` directory, which can also be accomplished with `cargo fetch`). Since it doesn't have much utility, I don't see much reason to keep it around.

The relevant part was removed in 1daff03 LAST_UPDATED_FILE was never used even before rust-lang#6880. They were just leftover during the PR updates.

While rust-lang#14897 reported packages with an unsupported index schema version, that only worked if the changes in the schema version did not cause errors in deserializing `IndexPackage` or in generating a `Summary`. This extends that change by recoverying on error with a more lax, incomplete parse of `IndexPackage` which should always generate a valid `Summary`. To help with a buggy Index, we also will report as many as we can. This does not provide a way to report to users or log on cache reads if the index entry is not at least `{"name": "<string>", "vers": "<semver>"}`. As a side effect, the index cache will include more "invalid" index entries. That should be ok as we will ignore the invalid entry in the cache when loading it. Ignoring of invalid entries dates back to rust-lang#6880 when the index cache was introduced. Fixes rust-lang#10623 Fixes rust-lang#14894

### What does this PR try to resolve? While #14897 reported packages with an unsupported index schema version, that only worked if the changes in the schema version did not cause errors in deserializing `IndexPackage` or in generating a `Summary`. This extends that change by recoverying on error with a more lax, incomplete parse of `IndexPackage` which should always generate a valid `Summary`. To help with a buggy Index, we also will report as many as we can. This does not provide a way to report to users or log on cache reads if the index entry is not at least `{"name": "<string>", "vers": "<semver>"}`. Fixes #10623 Fixes #14894 ### How should we test and review this PR? My biggest paranoia is some bad interaction with the index cache including more "invalid" index entries. That should be ok as we will ignore the invalid entry in the cache when loading it. Ignoring of invalid entries dates back to #6880 when the index cache was introduced. ### Additional information

rust-highfive assigned Eh2406 Apr 26, 2019

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Apr 26, 2019

alexcrichton force-pushed the cache branch from 8b1a902 to 720d4d1 Compare April 26, 2019 15:49

alexcrichton force-pushed the cache branch from 68dcc83 to 25ee430 Compare April 26, 2019 18:52

Eh2406 reviewed Apr 30, 2019

View reviewed changes

alexcrichton force-pushed the cache branch from 25ee430 to 928c867 Compare April 30, 2019 21:28

Eh2406 reviewed May 1, 2019

View reviewed changes

src/cargo/sources/registry/index.rs Outdated Show resolved Hide resolved

alexcrichton force-pushed the cache branch 2 times, most recently from 1258890 to b8ca83a Compare May 1, 2019 14:32

alexcrichton added 3 commits May 3, 2019 07:23

Don't allocate in SourceId::is_default_registry

c7e1b68

This gets called quite a lot and doesn't need to allocate in the first place!

bors mentioned this pull request May 3, 2019

Import the cargo-vendor subcommand into Cargo #6869

Merged

alexcrichton deleted the cache branch May 6, 2019 15:40

alexcrichton mentioned this pull request May 6, 2019

Optimizes Cargo's registry cache format for fewer files #6908

Open

ehuss mentioned this pull request May 7, 2019

Update cargo rust-lang/rust#60596

Merged

glasserc mentioned this pull request May 9, 2019

Failing to build packages with Cargo nightly NixOS/nixpkgs#61192

Closed

This was referenced May 10, 2019

cargo check --frozen requires the cargo home to be writable on nightly #6928

Closed

cargo fetch can't be parallelized on nightly anymore #6930

Closed

alexcrichton mentioned this pull request May 14, 2019

Re-enable compatibility with readonly CARGO_HOME #6940

Merged

This was referenced May 24, 2019

Make resolution with correct lock file faster #5321

Closed

Crate graph resolution on parity-wasm with a lockfile is noticeably slow #5817

Closed

alexcrichton mentioned this pull request Jul 30, 2019

Fix excluding target dirs from backups on OSX #7192

Merged

This was referenced Sep 3, 2019

don't need to copy this string #7324

Merged

map_dependencies is doing a deep clone, so lets make it cheaper #7326

Merged

ehuss mentioned this pull request Mar 25, 2020

Remove the git-checkout subcommand. #8040

Merged

ehuss added this to the 1.36.0 milestone Feb 6, 2022

weihanglo added a commit to weihanglo/cargo that referenced this pull request Jun 9, 2023

refactor: remove leftover of rust-lang#6880

768708f

The relevant part was removed in 1daff03 LAST_UPDATED_FILE was never used even before rust-lang#6880. They were just leftover during the PR updates.

epage mentioned this pull request Dec 12, 2024

fix(resolver): Report invalid index entries #14927

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse less JSON on null builds #6880

Parse less JSON on null builds #6880

alexcrichton commented Apr 26, 2019

rust-highfive commented Apr 26, 2019

alexcrichton commented Apr 26, 2019

Eh2406 commented Apr 26, 2019

alexcrichton commented Apr 26, 2019

alexcrichton commented Apr 26, 2019

alexcrichton commented Apr 26, 2019

Eh2406 commented Apr 26, 2019

alexcrichton commented Apr 29, 2019

bors commented Apr 29, 2019

Eh2406 Apr 30, 2019

alexcrichton Apr 30, 2019

Eh2406 commented Apr 30, 2019 •

edited

Loading

alexcrichton commented Apr 30, 2019

Eh2406 commented Apr 30, 2019 •

edited

Loading

alexcrichton commented Apr 30, 2019

Eh2406 commented May 1, 2019

alexcrichton commented May 1, 2019

Eh2406 commented May 1, 2019

Eh2406 commented May 1, 2019

bors commented May 1, 2019

alexcrichton commented May 6, 2019

Parse less JSON on null builds #6880

Parse less JSON on null builds #6880

Conversation

alexcrichton commented Apr 26, 2019

rust-highfive commented Apr 26, 2019

alexcrichton commented Apr 26, 2019

Eh2406 commented Apr 26, 2019

alexcrichton commented Apr 26, 2019

alexcrichton commented Apr 26, 2019

alexcrichton commented Apr 26, 2019

Eh2406 commented Apr 26, 2019

alexcrichton commented Apr 29, 2019

bors commented Apr 29, 2019

Eh2406 Apr 30, 2019

Choose a reason for hiding this comment

alexcrichton Apr 30, 2019

Choose a reason for hiding this comment

Eh2406 commented Apr 30, 2019 • edited Loading

alexcrichton commented Apr 30, 2019

Eh2406 commented Apr 30, 2019 • edited Loading

alexcrichton commented Apr 30, 2019

Eh2406 commented May 1, 2019

alexcrichton commented May 1, 2019

Eh2406 commented May 1, 2019

Eh2406 commented May 1, 2019

bors commented May 1, 2019

alexcrichton commented May 6, 2019

Eh2406 commented Apr 30, 2019 •

edited

Loading

Eh2406 commented Apr 30, 2019 •

edited

Loading