Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse less JSON on null builds #6880

Merged
merged 7 commits into from
May 3, 2019
Merged

Parse less JSON on null builds #6880

merged 7 commits into from
May 3, 2019

Conversation

alexcrichton
Copy link
Member

This commit fixes a performance pathology in Cargo today. Whenever Cargo
generates a lock file (which happens on all invocations of cargo build
for example) Cargo will parse the crates.io index to learn about
dependencies. Currently, however, when it parses a crate it parses the
JSON blob for every single version of the crate. With a lock file,
however, or with incremental builds only one of these lines of JSON is
relevant. Measured today Cargo building Cargo parses 3700 JSON
dependencies in the registry.

This commit implements an optimization that brings down the number of
parsed JSON lines in the registry to precisely the right number
necessary to build a project. For example Cargo has 150 crates in its
lock file, so now it only parses 150 JSON lines (a 20x reduction from
3700). This in turn can greatly improve Cargo's null build time. Cargo
building Cargo dropped from 120ms to 60ms on a Linux machine and 400ms
to 200ms on a Mac.

The commit internally has a lot more details about how this is done but
the general idea is to have a cache which is optimized for Cargo to read
which is maintained automatically by Cargo.

Closes #6866

@rust-highfive
Copy link

r? @Eh2406

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Apr 26, 2019
@alexcrichton
Copy link
Member Author

This isn't 100% ready to go yet since it still doesn't handle concurrent writes into the global cache, but I wanted to put this up for initial thoughts if there were any! I hope to fix the global cache write synchronization tomorrow

@Eh2406
Copy link
Contributor

Eh2406 commented Apr 26, 2019

This is realy cool! I will have to revue when I am more awake. Overall I wonder how can we test this well. How can we be absolutely sure the there are no race conditions, or other bugs, leading to things getting out of sink? Now and in the future? Similarly for making sure that the files work across cargo versions.

@alexcrichton
Copy link
Member Author

Heh that's a good question! I don't think we can really be sure that race conditions/bugs are gone related to that ever. It's largely I think about how we architect locking and evaluate it if it feels as foolproof as possible. I had an idea this morning I'm going to toy which that I'm pretty confident in, but really we can only get but so far here.

In terms of working with Cargo against future versions it's sort of the same, I'm trying to be very liberal with ignoring errors and proactive with some degree of versioning, but in reality there's really only so much we can do against this I think.

@alexcrichton
Copy link
Member Author

Ok I've updated with a strategy to lock the index and ensure that concurrent updates work ok, even with this new caching strategy. The new locking strategy is to basically just not have granular locks and instead have one large global lock protecting all of resolve, for example. This is done to avoid us having to worry about all these concurrent updates, and it in theory isn't any loss in functionality either.

@alexcrichton
Copy link
Member Author

@ehuss you might be intersted in the locking commit as well

@Eh2406
Copy link
Contributor

Eh2406 commented Apr 26, 2019

I did not grock this yet, but some thoughts:

  • Can we make CURRENT_CACHE_VERSION part of the file path so we reduce the chance of cross talk? (or has this already been done.) Actually, where are the cache file stored?
  • Should we have a debug assertion that the cached version matches the canonical one?
  • Can we have test that the Cargo that was used to build the tests is compatible with the Cargo being tested? (This would be good for lock files as well.) Set up a registry, have the host Cargo make its cache files, have the test Cargo run to see if it correctly uses or ignores the files, have the host Cargo run to make sure we don't brake.
  • Can we have test that the host Cargo and the test Cargo, can run concurrently without messing up each others lock? BTW what happens if new Cargo gets a new course lock while an old Cargo has a granular lock?
  • Given that the registry is a Git project can we use the Hash of head instead of mtime? (mtime, often does weird things)

@alexcrichton
Copy link
Member Author

I'm personally not really super concerned about cross-cargo-version issues here. I think we need to at a bare minimum ensure it works (nothing gets corrupted across Cargo invocations), but other than that I feel like it's a bit much for us to maintain anything else. For example I don't think we need to optimize for the use case where you oscillate between Cargo versions and it might thrash the cache that we're building here. The purpose of this PR is to reduce the overhead of Cargo as much as possible on incremental builds, and part of the incremental aspect is not changing Cargo that much!

In that sense I could make the version part of the path for sure and we could reduce cache thrashing, but I don't think it's too too important here. Additionally while it's happened to work in the past I don't think we should strive to say that concurrent invocations of different versions of Cargo are supposed to work (rather only concurrent invocations of the same version are expected to work).

I like the idea of a debug assertion and using the git hash instead of the mtime, I'll look to implement those later when this is closer to being god to go!

For testing, I'm not sure how we'd manage that unfortunately. We can't really rely on the host Cargo to be any particular version so the tests would already have to be really loose. It may be best to just unit-test the code in question and make sure that error handling is as conservative as possible, since I'm not sure how to best test these things (but I think it's pretty minor)

@bors
Copy link
Contributor

bors commented Apr 29, 2019

☔ The latest upstream changes (presumably #6871) made this pull request unmergeable. Please resolve the merge conflicts.

@@ -14,6 +14,7 @@ pub use self::shell::{Shell, Verbosity};
pub use self::source::{GitReference, Source, SourceId, SourceMap};
pub use self::summary::{FeatureMap, FeatureValue, Summary};
pub use self::workspace::{Members, Workspace, WorkspaceConfig, WorkspaceRootConfig};
pub use self::interning::InternedString;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot of Cow<'_, str> that can be replaced with InternedString now that it is publick.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that was one thing I was going to try if JSON parsing still showed up in the profile, but after this PR the JSON parsing disappeared so I think it's less pressing to do that just yet (can be a follow-up of course!)

@Eh2406
Copy link
Contributor

Eh2406 commented Apr 30, 2019

I did a more indepth review it looks good, and I really like it!
Three fundamental questions:

    1. Why do the processing on the client side, why not have the index just write the better format in the first place? (Presumably this is a smaller change, givin now that custom registries are stable.)
    1. If we are building a custom format why stay with JSON? (Presumably we can always experiment with changing this after this lands, so start with the smallest change.
    1. Lots of small files, how does this perform on windows... (Presumably you want me to answer that)

@alexcrichton
Copy link
Member Author

Good questions!


Why do the processing on the client side, why not have the index just write the better format in the first place? (Presumably this is a smaller change, givin now that custom registries are stable.)

For doing this on the client side rather than the index itself, I think the main reason is feasibility. I've long figured that the index will not satisfy Cargo until the end of time and we'd need an even faster indexing format at some point. Having the capability of a local cache managed by Cargo allows us to divorce these two aspects. The index is primarily focused on optimizing for delta updates (making it super easy and fast to incrementally update it), whereas Cargo's problem is different where it's effectively requiring random access to the index. I think that if this is the only change we make to Cargo's internal format for a long time to come, but otherwise I think it's inevitable that Cargo's own format for the index on disk is divorced from the index's upstream format.

Frankly though for the index format it's just way easier to do it in Cargo. The amount of effort needed to change the index itself and make sure everything doesn't break means that this probably wouldn't get done.


If we are building a custom format why stay with JSON? (Presumably we can always experiment with changing this after this lands, so start with the smallest change.

Heh another good question! I wasn't sure whether JSON would still be slow, so this is where I let the profiles guide me. It was obvious from before that parsing thousands of lines of JSON took hundreds of milliseconds and the easiest win was to simply not parse thousands of lines but only the handful needed. Since then I haven't seen JSON parsing in the profile.

We could of course, however, change the format of the cache files at any time. That's the point of the local cache for Cargo :). If even JSON is too slow we'll have to carefully design a new format to ensure it preserves all the relevant information, but it's certainly possible to do so.


Lots of small files, how does this perform on windows...

Ah yeah unfortunately I don't have access to my Windows machine right now to test this out. Previously we were reading one big git file but now we're reading a lot of little files around the filesystem, so I'm honestly not entirely sure what the performance is. It's a good point though and something we should measure before landing. Would you be up for helping me out with measurements?

@Eh2406
Copy link
Contributor

Eh2406 commented Apr 30, 2019

I ran a number of commands with two versions of cargo. One is from the head of this PR (25ee430d41cb48be76d31747847e007db614b234) the other from after the small optimizations (319e9bb09bdb8003310ed756ac627235bf46af1a)
both rebase on master (af1fcb3). Both build locally with --release from rustc 1.36.0-nightly (6d599337f 2019-04-22). The script ran each combination once outside the timing loop then calculated the wall time to run the command 15 times in the Cargo project.

for command in [["update"], "update -p hex".split(), ["generate-lockfile"], ["build"]]:
    for cargo in ["PR", "Master"]:
        p1 = subprocess.Popen(['speed/cargo-' + cargo, command, "-Zno-index-update"],
                              stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        p1.communicate()
        start = time.clock()
        n = 15
        for i in range(n):
            p1 = subprocess.Popen(['speed/cargo-' + cargo] + command + ["-Zno-index-update"],
                                  stdout=subprocess.PIPE, stderr=subprocess.PIPE)
            p1.communicate()
        print command, cargo, (time.clock() - start) / n

The table below shows the average wall time in seconds for each combination.

Edit: looks like you need to update the index to get the new speeds. So here are the numbers with the same artifacts and commands after I once ran a command without "-Zno-index-update",

Edit: the command for update -p hex was malformed.

command Master PR %change
update 0.27955078 0.195570633333 30%
update -p hex 0.266721386667 0.151231233333 43%
generate-lockfile 0.266229426667 0.171039146667 35%
build 0.714220346667 0.711805573333 0.3%

@alexcrichton
Copy link
Member Author

Ok I've rebased and pushed up a commit which uses git sha information instead of mtime information which should be more robust, as well as an additional commit which adds a debug assertion that if we think the cache is fresh it actually is.

@Eh2406
Copy link
Contributor

Eh2406 commented May 1, 2019

New timings are:

command Master PR %change
update 0.26499556 0.183760626667 30.6%
update -p hex 0.247472593333 0.13809348 44.2%
generate-lockfile 0.249164073333 0.16347192 34.4%
build 0.6984459 0.69787098 0.1%

@alexcrichton alexcrichton force-pushed the cache branch 2 times, most recently from 1258890 to b8ca83a Compare May 1, 2019 14:32
@alexcrichton
Copy link
Member Author

Updated!

@Eh2406
Copy link
Contributor

Eh2406 commented May 1, 2019

Ok this looks good! Other improvements can always be done in follow up PRs.

I keep getting distracted by changes that may be small improvements, when according to my profiling the only thing that matters is a design that uses fewer files.

@Eh2406
Copy link
Contributor

Eh2406 commented May 1, 2019

So here is a fleshed out version of my 3 file straw man.

  • index.json is a uncompressed concatenated version of all of the index files we have actually read. In the existing format.
  • versions.lookup is an fast way to look up where in index.json the row is for each version. It would have the format version-string\0start\0end\n.
  • names.lookup is an fast way to look up where in versions.lookup are the rows for each name. It would have the format name\0start\0end\n.

to do a search:

  1. read in and parce all of names.lookup. If the name we want is not in the then read it from the raw index and append to the files.
  2. read in all of versions.lookup, but only parce the part of the file that the names.lookup told us to.
  3. read in all of index.json, but only parce the part of the file that the versions.lookup told us to.

when we pull a new version of the index dell the files.

"read in all of" can probably be mmap, if that is a bottleneck.

Also we should look if there are existing libraries for on disk indexing into a file... https://crates.io/crates/csv-index mabey? actually that does something very similar.

@bors
Copy link
Contributor

bors commented May 1, 2019

☔ The latest upstream changes (presumably #6896) made this pull request unmergeable. Please resolve the merge conflicts.

This was currently getting executed on all builds, even if the directory
already exists. There shouldn't be any reason though to exclude the
directory from backups on all builds, and after seeing this get a stack
sample in a profile I figured it's best to ensure it only executes once
in case the backing system implementation isn't the speediest.
This gets called quite a lot and doesn't need to allocate in the first
place!
This commit fixes a performance pathology in Cargo today. Whenever Cargo
generates a lock file (which happens on all invocations of `cargo build`
for example) Cargo will parse the crates.io index to learn about
dependencies. Currently, however, when it parses a crate it parses the
JSON blob for every single version of the crate. With a lock file,
however, or with incremental builds only one of these lines of JSON is
relevant. Measured today Cargo building Cargo parses 3700 JSON
dependencies in the registry.

This commit implements an optimization that brings down the number of
parsed JSON lines in the registry to precisely the right number
necessary to build a project. For example Cargo has 150 crates in its
lock file, so now it only parses 150 JSON lines (a 20x reduction from
3700). This in turn can greatly improve Cargo's null build time. Cargo
building Cargo dropped from 120ms to 60ms on a Linux machine and 400ms
to 200ms on a Mac.

The commit internally has a lot more details about how this is done but
the general idea is to have a cache which is optimized for Cargo to read
which is maintained automatically by Cargo.

Closes rust-lang#6866
@alexcrichton
Copy link
Member Author

I've opened #6908 to track your suggestion @Eh2406

bors added a commit to rust-lang/rust that referenced this pull request May 7, 2019
Update cargo

12 commits in beb8fcb5248dc2e6aa488af9613216d5ccb31c6a..759b6161a328db1d4863139e90875308ecd25a75
2019-04-30 23:58:00 +0000 to 2019-05-06 20:47:49 +0000
- Small things (rust-lang/cargo#6910)
- Fix skipping over invalid registry packages (rust-lang/cargo#6912)
- Fixes rust-lang/cargo#6874 (rust-lang/cargo#6905)
- doc: Format examples of version to ease reading (rust-lang/cargo#6907)
- fix more typos (codespell) (rust-lang/cargo#6903)
- Parse less JSON on null builds (rust-lang/cargo#6880)
- chore: Update opener to 0.4 (rust-lang/cargo#6902)
- Update documentation for auto-discovery. (rust-lang/cargo#6898)
- Update some doc links. (rust-lang/cargo#6897)
- Default Cargo.toml template provide help for completing the metadata (rust-lang/cargo#6881)
- Run 'cargo fmt --all' (rust-lang/cargo#6896)
- Refactor command definition (rust-lang/cargo#6894)
alexcrichton added a commit to alexcrichton/cargo that referenced this pull request May 14, 2019
Previously Cargo would attempt to work as much as possible with a
previously filled out CARGO_HOME, even if it was mounted as read-only.
In rust-lang#6880 this was regressed as a few global locks and files were always
attempted to be opened in writable mode.

This commit fixes these issues by correcting two locations:

* First the global package cache lock has error handling to allow
  acquiring the lock in read-only mode inaddition to read/write mode. If
  the read/write mode failed due to an error that looks like a readonly
  filesystem then we assume everything in the package cache is readonly
  and we switch to just acquiring any lock, this time a shared readonly
  one. We in theory aren't actually doing any synchronization at that
  point since it's all readonly anyway.

* Next when unpacking package we're careful to issue a `stat` call
  before opening a file in writable mode. This way our preexisting guard
  to return early if a package is unpacked will succeed before we open
  anything in writable mode.

Closes rust-lang#6928
bors added a commit that referenced this pull request May 14, 2019
Re-enable compatibility with readonly CARGO_HOME

Previously Cargo would attempt to work as much as possible with a
previously filled out CARGO_HOME, even if it was mounted as read-only.
In #6880 this was regressed as a few global locks and files were always
attempted to be opened in writable mode.

This commit fixes these issues by correcting two locations:

* First the global package cache lock has error handling to allow
  acquiring the lock in read-only mode inaddition to read/write mode. If
  the read/write mode failed due to an error that looks like a readonly
  filesystem then we assume everything in the package cache is readonly
  and we switch to just acquiring any lock, this time a shared readonly
  one. We in theory aren't actually doing any synchronization at that
  point since it's all readonly anyway.

* Next when unpacking package we're careful to issue a `stat` call
  before opening a file in writable mode. This way our preexisting guard
  to return early if a package is unpacked will succeed before we open
  anything in writable mode.

Closes #6928
alexcrichton added a commit to alexcrichton/cargo that referenced this pull request Jul 30, 2019
This fixes an accidental regression from rust-lang#6880 identified in rust-lang#7189 by
moving where the configuration of backup preferences happens since it
was accidentally never happening due to the folder always having been
created.
bors added a commit that referenced this pull request Jul 31, 2019
Fix excluding target dirs from backups on OSX

This fixes an accidental regression from #6880 identified in #7189 by
moving where the configuration of backup preferences happens since it
was accidentally never happening due to the folder always having been
created.

Closes #7189
bors added a commit that referenced this pull request Sep 4, 2019
don't need to copy this string

This removes a `String::clone` that I noticed when profiling no-op builds of cargo, benchmarks show a barely visible improvement. Looks like it was added in #6880, but I am not sure why.
bors added a commit that referenced this pull request Sep 4, 2019
`map_dependencies` is doing a deep clone, so lets make it cheaper

This removes a `FeatureMap::clone` that I noticed when profiling no-op builds of cargo, benchmarks show a ~5% improvement. Looks like #6880 means that there is a ref to every `Summery` so the `Rc::make_mut` dose a deep clone.
bors added a commit that referenced this pull request Mar 25, 2020
Remove the `git-checkout` subcommand.

This command has been broken for almost a year (since #6880), and nobody has mentioned it. The command isn't very useful (it checks out into cargo's `db` directory, which can also be accomplished with `cargo fetch`).  Since it doesn't have much utility, I don't see much reason to keep it around.
@ehuss ehuss added this to the 1.36.0 milestone Feb 6, 2022
weihanglo added a commit to weihanglo/cargo that referenced this pull request Jun 9, 2023
The relevant part was removed in 1daff03
LAST_UPDATED_FILE was never used even before rust-lang#6880.
They were just leftover during the PR updates.
epage added a commit to epage/cargo that referenced this pull request Dec 12, 2024
While rust-lang#14897 reported packages with an unsupported index schema version,
that only worked if the changes in the schema version did not cause
errors in deserializing `IndexPackage` or in generating a `Summary`.

This extends that change by recoverying on error with a more lax,
incomplete parse of `IndexPackage` which should always generate a valid
`Summary`.

To help with a buggy Index, we also will report as many as we can.
This does not provide a way to report to users or log on cache reads if
the index entry is not at least `{"name": "<string>", "vers": "<semver>"}`.

As a side effect, the index cache will include more "invalid" index
entries.
That should be ok as we will ignore the invalid entry in the cache when
loading it.
Ignoring of invalid entries dates back to rust-lang#6880 when the index cache was
introduced.

Fixes rust-lang#10623
Fixes rust-lang#14894
epage added a commit to epage/cargo that referenced this pull request Dec 12, 2024
While rust-lang#14897 reported packages with an unsupported index schema version,
that only worked if the changes in the schema version did not cause
errors in deserializing `IndexPackage` or in generating a `Summary`.

This extends that change by recoverying on error with a more lax,
incomplete parse of `IndexPackage` which should always generate a valid
`Summary`.

To help with a buggy Index, we also will report as many as we can.
This does not provide a way to report to users or log on cache reads if
the index entry is not at least `{"name": "<string>", "vers": "<semver>"}`.

As a side effect, the index cache will include more "invalid" index
entries.
That should be ok as we will ignore the invalid entry in the cache when
loading it.
Ignoring of invalid entries dates back to rust-lang#6880 when the index cache was
introduced.

Fixes rust-lang#10623
Fixes rust-lang#14894
github-merge-queue bot pushed a commit that referenced this pull request Dec 13, 2024
### What does this PR try to resolve?

While #14897 reported packages with an unsupported index schema version,
that only worked if the changes in the schema version did not cause
errors in deserializing `IndexPackage` or in generating a `Summary`.

This extends that change by recoverying on error with a more lax,
incomplete parse of `IndexPackage` which should always generate a valid
`Summary`.

To help with a buggy Index, we also will report as many as we can.
This does not provide a way to report to users or log on cache reads if
the index entry is not at least `{"name": "<string>", "vers":
"<semver>"}`.

Fixes #10623
Fixes #14894

### How should we test and review this PR?

My biggest paranoia is some bad interaction with the index cache
including more "invalid" index entries.
That should be ok as we will ignore the invalid entry in the cache when
loading it.
Ignoring of invalid entries dates back to #6880 when the index cache was
introduced.

### Additional information
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve Cargo's load time of crates.io index information
5 participants