CI Installation performance #6376

dmichon-msft · 2023-04-11T01:08:34Z

dmichon-msft
Apr 11, 2023
Collaborator

Summary

I've recently been doing work on dependency installation performance, since installations on my team's Linux CI agents are rather slower than I'd like (our repository serves about 300 developers at Microsoft). Following some guidance from one of my colleagues who has been doing installation performance work for an internal fork of Yarn, I was able to put together a CI-mode installer that reads the pnpm-lock.yaml and installs all the packages in the same layout (as seen by user code) in about 8% of the time.

Old Install (`pnpm install --frozen-lockfile`)

pnpm version: 7.26.1
total packages: 7032 downloaded, 7406 installed, 8297 linked (including workspace packages)
mode: no cache, yes lockfile, no node_modules
agent: D32ads_v5 Azure SKU
os: Linux
duration: 5 minutes, 14.5 seconds

New Install (`rush phased-install`) (local custom algorithm)

total packages: 7032 downloaded, 7406 installed, 8297 linked (including workspace packages)
mode: no cache, yes lockfile, no node_modules
agent: D32ads_v5 Azure SKU
os: Linux
duration: 23.6 seconds

How the custom installer works

Reads pnpm-lock.yaml, constructs a task graph
Opens 10 (configurable) separate HTTP/2 connections to the registry endpoint (via the raw node:http2 connect(url, options) API).
For each unique package/version (tarball) in the graph, creates an HTTP/2 stream on one of the connections (uniformly distributed), expecting to receive a 303 redirect. The redirect occurs because the underlying streams don't originate on the registry server.
Forward the results from each redirect to a concurrent asynchronous queue that will download the tarball into memory via use of the raw node:https module. Configure the https.Agent to ensure that the timeout is at least 60 seconds.
When each tarball finishes downloading, merge the results into a new ArrayBuffer and postMessage it (with transfer) to a worker thread for parsing. Allow up to #CPUs * 0.9 (configurable) concurrent worker threads. This ensures that the main thread is not doing any significant CPU work.
In the worker thread, validate the integrity hash of the raw tarball, then perform synchronous gunzip and copy the result to a SharedArrayBuffer. Synchronously parse the raw TAR buffer, identifying generating a Map from filename to { mode, offset, length }.
Send the SharedArrayBuffer and map back to the orchestrator via postMessage, so that it can be load balanced again, and so that the orchestrator can get access to the information in package.json.
Load balance the parsed tar buffers across a pool of worker threads (just as for parsing), each of which performs synchronous I/O via mkdirSync/openSync/writeSync/closeSync to produce the output files.
After the dependencies of each package have been unpacked, perform any necessary installation steps (link bins, apply patches, run install scripts).

While the download is ongoing, the installer can be busy creating all the necessary node_modules symlink layouts, since that doesn't depend on the downloads.

Performance notes

On Windows, unpack time increases by about 10x relative to Linux. I haven't found any ways around this; the parse/unpack routine is already about 2x as fast as the native tar.exe.

The biggest performance gain here comes from using synchronous I/O in worker threads, since Node's async APIs ultimately are just synchronous I/O in worker threads, but by controlling the threads directly a lot of overhead can be avoided.

The next biggest performance gain is the use of multiple HTTP/2 connections for the registry communication. Using multiple connections separates TCP congestion control and allows the registry's load balancer to distribute them across multiple servers, while still taking advantage of HTTP/2's ability to send a large number of streams over the same connection.
Edit: Appears that using HTTP/1.1 doesn't have a significant performance difference vs. HTTP/2, unless the number of concurrent streams is much higher than the number of open sockets.

It's entirely possible that performance could be improved a fair bit further, since I still used Node's async I/O for symlink creation instead of handing that off to the same pool of workers. Additionally, it might be beneficial to perform the downloads directly in the worker thread that will also perform the parsing, so that integrity checking and gunzip can be streamed without blocking the main thread. This would also avoid the need to copy the tarball to a single buffer before decompression.

Parse cost is dominated by gunzip.

I can share detailed code from the prototype; there's nothing proprietary beyond that I happen to have currently hardcoded the registry domain and process of obtaining the Authorization header, but that's easily abstracted.

octogonz · 2023-07-06T20:39:50Z

octogonz
Jul 6, 2023
Collaborator

@dmichon-msft I'd like to help push this idea along, but we need a better name for the feature. "Phased install" isn't very accurate since your prototype ended up not actually implementing per-project phases (given that significant speed gains were already achieved by the other optimizations).

A key aspect involves distributing the work among a pool of worker threads, so...

What if we call your algorithm a threaded install for PNPM?

@D4N14L

1 reply

gluxon Jul 10, 2023
Collaborator

I like "threaded install" as well.

zkochan · 2023-07-07T09:06:25Z

zkochan
Jul 7, 2023
Maintainer

@dmichon-msft sorry for not seeing this discussion earlier. This is amazing! This is actually partially what @mcollina has suggested to do recently.

Could you make a PR with this improvement? Even if it is not ready yet, we can work together to polish it.

1 reply

mcollina Jul 7, 2023

I'm glad somebody else reached the same conclusions! I know this kind of refactoring is a lot of hard work.

zkochan · 2023-07-07T17:25:48Z

zkochan
Jul 7, 2023
Maintainer

@nachoaldamav this might be interesting to you as you were also experimenting with sync fs operations in ultra

0 replies

dmichon-msft · 2023-07-10T22:51:04Z

dmichon-msft
Jul 10, 2023
Collaborator Author

@zkochan
The code that I prototyped this with is here: microsoft/rushstack#4229
Note that it was plugged into Rush directly due to my familiarity and the fact that I'm bypassing most of the work a package manager does (e.g. resolution). The area that will be of most interest, however, is the tarball worker, which is quite standalone.

3 replies

zkochan Jul 10, 2023
Maintainer

Thanks for sharing the code! I will look into it.

dmichon-msft Jul 10, 2023
Collaborator Author

There is definitely something screwy with pnpm's networking layer, at least in v7; I see in my trace of doing pnpm install --frozen-lockfile in my repository that around 18 seconds was spent in the onCloseNT callback (next tick after closing a socket). It looks like this has something to do with overhead from using Node's pipeline helper, based on the call stack. When it comes to performance and reliability of NodeJS networking code, at this point I've had the best results from throwing out all the helper libraries and directly using the node:https module.

The prototype downloads all network streams directly to memory and assembles a single ArrayBuffer of each response body once all the data has arrived. Counterintuitively it works out faster to not do anything with the data until the download has completely succeeded, especially in cases where the socket gets reset and the download restarts (though likely there is something to be said for implementing resume, particularly when targeting a local environment rather than running in a data center).

dmichon-msft Jul 11, 2023
Collaborator Author

I'd recommend the first steps that are likely to do the most for local installs are:

Download directly to memory and build an ArrayBuffer. Avoid doing hashing or other operations until the stream has been fully received. This makes it easier to profile the contribution from network overhead.
Steal my parse/gunzip/integrity worker (though consider adding a second pass to generate your CAFS hashes for each constituent file) and use it for everything after receiving the ArrayBuffer and before writing buffers to the CAFS
Consider replacing the algorithm for generating temp file names with either the lowest order bits of the current time stamp, or a single unique suffix for the entire session appended to the target file name, since you should only be writing any given file once in the session; profiling v7 shows several seconds spent in cryptographic random number generation just for temp file names.

osdiab · 2023-12-20T06:43:26Z

osdiab
Dec 20, 2023

I am not sure if this has any relation, but I've noticed in my own project with ~1800 dependencies that on my CI machines, even if the pnpm store and node_modules is fully cached and PNPM reports that it's downloading nothing, it still takes 15-25 seconds for pnpm install to run, with the progress output looking like this:

Progress: resolved 1887, reused 16, downloaded 0, added 0
Progress: resolved 1887, reused 1138, downloaded 0, added 322
Progress: resolved 1887, reused 1882, downloaded 0, added 681
Progress: resolved 1887, reused 1882, downloaded 0, added 1111
Progress: resolved 1887, reused 1882, downloaded 0, added 1276
Progress: resolved 1887, reused 1882, downloaded 0, added 1644
Progress: resolved 1887, reused 1882, downloaded 0, added 1884
Progress: resolved 1887, reused 1882, downloaded 0, added 1886
Progress: resolved 1887, reused 1882, downloaded 0, added 1887, done

Not sure if there's something that I can do to make it go faster given that this is delaying almost every step in our pipeline, adding many minutes to CI runs overall - it seems that locally it would just say Already up to date almost instantly in this scenario but maybe I don't understand what's actually going on.

2 replies

zkochan Jan 23, 2024
Maintainer

I don't think this is related to the discussion.

I don't know why this is so slow if both node_modules and the store are cached. Are the symlinks in node_modules cached correctly? It might be tricky to correctly pack symlinks to a tarball.

osdiab Jan 24, 2024

Not sure how symlinks are being treated—I'm using Namespace cache volumes to cache my dependencies and pretty sure it's just mounting a volume with all the files to cache them on runs, I've asked them how it works internally. But it was doing the same thing when I just point the GitHub Actions cache-action to the node_modules and pnpm stores for our monorepo.

Would it be expected in a CI use case for that Already up to date output to show up or is it expected that it would be "adding" each of the packages on install? I don't actually understand what reused and added means exactly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pnpm

CI Installation performance #6376

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

pnpm

CI Installation performance #6376

dmichon-msft Apr 11, 2023 Collaborator

Summary

Old Install (pnpm install --frozen-lockfile)

New Install (rush phased-install) (local custom algorithm)

How the custom installer works

Performance notes

Replies: 5 comments · 7 replies

octogonz Jul 6, 2023 Collaborator

gluxon Jul 10, 2023 Collaborator

zkochan Jul 7, 2023 Maintainer

mcollina Jul 7, 2023

zkochan Jul 7, 2023 Maintainer

dmichon-msft Jul 10, 2023 Collaborator Author

zkochan Jul 10, 2023 Maintainer

dmichon-msft Jul 10, 2023 Collaborator Author

dmichon-msft Jul 11, 2023 Collaborator Author

osdiab Dec 20, 2023

zkochan Jan 23, 2024 Maintainer

osdiab Jan 24, 2024

dmichon-msft
Apr 11, 2023
Collaborator

Old Install (`pnpm install --frozen-lockfile`)

New Install (`rush phased-install`) (local custom algorithm)

Replies: 5 comments 7 replies

octogonz
Jul 6, 2023
Collaborator

gluxon Jul 10, 2023
Collaborator

zkochan
Jul 7, 2023
Maintainer

zkochan
Jul 7, 2023
Maintainer

dmichon-msft
Jul 10, 2023
Collaborator Author

zkochan Jul 10, 2023
Maintainer

dmichon-msft Jul 10, 2023
Collaborator Author

dmichon-msft Jul 11, 2023
Collaborator Author

osdiab
Dec 20, 2023

zkochan Jan 23, 2024
Maintainer