Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Globbing the whole workspace could be faster #1182

Closed
Blugatroff opened this issue Jan 26, 2024 · 10 comments
Closed

Globbing the whole workspace could be faster #1182

Blugatroff opened this issue Jan 26, 2024 · 10 comments

Comments

@Blugatroff
Copy link
Contributor

spago build takes surprisingly long to read it's configuration files.

When using my somewhat older laptop i noticed, that an invocation of spago build on my smallish AdventOfCode project took 3.5 seconds, even though everything had already been compiled.
I tried investigating what took so long by inserting logDebug and using the --verbose flag.

I noticed that running purs isn't actually the single most expensive step (took 800ms), instead globbing the workspace for every spago.yaml and parsing spago.lock takes quite long.
I usually expect a package manager to finish very quickly when there is no actual work to be done, even on slightly older hardware.
The old spago just takes 1.5s on the same project and device, so is this just the cost of writing spago in a language that compiles to javascript, or can we reduce this startup time? Or perhaps some of these steps can be skipped in certain cases?

How I measured:
time node ~/clones/spago/bin/bundle.js build --verbose shows:

[      25ms] CWD: /home/jasper/projects/pure-advent
[      50ms] Global cache: "/home/jasper/.cache/spago-nodejs"
[      52ms] Local cache: "/home/jasper/projects/pure-advent/.spago"
[     112ms] DB: Connecting to database at /home/jasper/.cache/spago-nodejs/spago.v1.sqlite
[     122ms] Reading Spago workspace configuration...
[     122ms] Reading config from spago.yaml
[     164ms] Before Glob.match **/spago.yaml                    I Added this logDebug
[     698ms] After Glob.match **/spago.yaml                       Took about 550ms
[     701ms] Selecting package aoc from ./
[     702ms] Before FS.readYamlFile                             I Added this logDebug
[    1540ms] After FS.readYamlFile                               Took about 800ms
[    1541ms] Lockfile is up to date, using it
[    1542ms] Found lockfile, using the package set from there
[    1548ms]
✅ Selecting package to build: aoc

[    1549ms] Package path: ./
[    1550ms] Requested to install these packages: []
[    1556ms] Downloading dependencies...
[    1619ms] Building...
[    1655ms] Running command:
purs compile --codegen corefn,docs,js,sourcemaps --json-errors .spago/p/aff-promise/3aa74e68e . . . . . . . . .
           Src   Lib   All
Warnings     0     0     0
Errors       0     0     0
[    2464ms]
✅ Build succeeded.

[    2464ms] Looking for unused and undeclared transitive dependencies...
[    2465ms] Running command:
purs graph .spago/p/aff-promise/3aa74e68e3e4c3e . . . . . . . . . . .
[    2659ms] Called `purs graph`, decoding..

real	0m3,561s
user	0m7,615s
sys	0m1,242s
@f-f
Copy link
Member

f-f commented Jan 26, 2024

Thanks for the detailed report @Blugatroff! We had a similar report in #1140, but that's unfortunately very light on details which made it hard to investigate.

Indeed 3.5s is a long time and I'd like it to be shorter, and think that it's possible to shorten it - according to your logs the globbing is 500ms, and reading the lockfile is 800. The first one is not too surprising but the latter is!
Do you have this repo hosted somewhere so I could take a look?

@wclr
Copy link
Contributor

wclr commented Jan 26, 2024

I believe a recursive walk using just FS.readDir with early exclusion of gitignore locations should do good enough work.

@Blugatroff
Copy link
Contributor Author

Do you have this repo hosted somewhere so I could take a look?

Yes @f-f , https://github.com/Blugatroff/pure-advent/

Could the globbing specifically also take long because there are a lots of deeply nested directories in my project directory which get searched unnecessarily? I'm talking about stuff like .psci_modules, generated-docs, output, output-es, .spago

@JordanMartinez
Copy link
Contributor

Could the globbing specifically also take long because there are a lots of deeply nested directories in my project directory which get searched unnecessarily?

That was the general conclusion I reached in #1140, but there wasn't a reproducible repo to confirm that.

@bakhtiyarneyman
Copy link

bakhtiyarneyman commented Feb 16, 2024

Maybe we can speed it up with something like Cargo.toml's members. (A benchmark should come before a solution, of course).

@f-f
Copy link
Member

f-f commented Mar 9, 2024

@Blugatroff I had a look at your pure-advent repo, and I can't repro the slow globbing - it takes about 30ms to read the globs, 250ms to read the lockfile, and 300ms to run a no-op purs build.

However, in an effort to dogfood ourselves a repo with a large amount of files, in #1197 I am adding a large number of text fixtures - this makes it easier to write tests, but it also adds a good number of directories to traverse for spago.
On that branch, I can see some globbing slowdown: it takes 400ms to glob the configs, only 150ms to read the lockfile, and about 500ms to run a no-op purs build.

400ms should be large enough to serve as a benchmark for further patches. The slow bit of code is here:

spago/src/Spago/Config.purs

Lines 177 to 187 in 455d94a

{ succeeded: otherConfigPaths, failed, ignored } <- do
result <- liftAff $ Glob.match' Paths.cwd [ "**/spago.yaml" ] { ignore: [ ".spago", "spago.yaml" ] }
-- If a file is gitignored then we don't include it as a package
let
filterGitignored path = do
Git.isIgnored path >>= case _ of
true -> pure $ Left path
false -> pure $ Right path
{ right: newSucceeded, left: ignored } <- partitionMap identity
<$> parTraverseSpago filterGitignored result.succeeded
pure { succeeded: newSucceeded, failed: result.failed, ignored }

There are two possible sources for the slowdown here:

  1. the globbing in line 178, where fast-glob is not being fast enough
  2. the check for gitignored files, where in line 182 we are calling git for every single config file that is matched by the glob

And we have a good trail we could follow:

  1. there is an issue about this in fast-glob, with mentions about using fs.walk and fs.readdir
  2. that ticket mentions another one in globby, which provides a code snippet that has a 8000% improvement over globby perf. This would solve our gitignore issue too.
  3. expanding on that, there's a whole fast implementation of globby in this PR (linked from the fast-glob ticket). I'm not sure we'll need all of that, so the snippet in the previous point might suffice

@Blugatroff
Copy link
Contributor Author

I have investigated this a bit further, i tried the code snippet you linked and it is indeed much faster.

While doing this, i also figured out which of my directories is responsible for most of the slowdown. It's my .direnv directory generated by the nix-direnv tool, it contains a bunch of symlinks to packages in my nix-store all of which get traversed by fast-glob.
Globbing only .direnv for **/spago.yaml took (on my slow laptop which triggered this issue) a glorious 2.07 seconds on first run and 575ms on subsequent runs (because of disk caching i would guess).
Globbing for **/spago.yaml using the snippet only takes ~30 ms

So ignoring gitignored directories before/while globbing instead of filtering afterwards is a huge improvement.

@f-f
Copy link
Member

f-f commented Apr 2, 2024

I have merged the PR linked above, and parsing times have improved for Spago itself as well - it now takes 150ms to glob the workspace (vs 400ms before the patch).

I have released the patch in a new version (0.93.29), could you try this out @wclr?

We could still squeeze out some more performance out of this - the ideal would be to just drop fast-glob, and instead have our own implementation that uses fs.walk and fs.readdir, and parses all the gitignore files. I'll leave this issue open to track this effort

@f-f f-f changed the title Parsing spago.lock and globbing the workspace is slow Globbing the whole workspace could be faster Apr 2, 2024
Blugatroff added a commit to Blugatroff/spago that referenced this issue Apr 5, 2024
Addresses purescript#1182

This implementation is using `micromatch` (a dependency of fast-glob)
and `fs.walk` to traverse the file system itself, instead of relying on fast-glob.
As opposed to the previous implementation (from purescript#1209) This implementation takes
every .gitignore file into account, not just the root one.
The callbacks `entryFilter` and `deepFilter` are used to control which
directories to recurse into. When `entryFilter` encounters a .gitignore
file, it's patterns are parsed into micromatch compatible ones and every
subsequent call to these filter functions respect them.

I'm not sure whether this should have been (partially) implemented in the
registry-dev repo since that's where the FFI for fast-glob is located.

I also did not remove the fast-glob npm dependency, since i'm not sure if
it is still in use elsewhere and because the registry-dev dependency
definitely still contains the foreign functions relying on it being available.
@f-f f-f closed this as completed in 56a36d5 Apr 19, 2024
@wclr
Copy link
Contributor

wclr commented Apr 29, 2024

I ran the latest version of spago against my project and got this:
image

So it (reading config step) took more than a minute.

Note, I did this test with the build from this repo, the latest published version on npm (0.93.29) differs and hangs longer on this task (I didn't wait for it).

Also, I had a glance at the latest PR with the changes, I didn't dig into it, but I just put a debug output on reading of ".gitignore" file and I got this while running spago install:
image

It is reading all the gitignore files... but ".spago" is included in the root .gitignore, so I doubt that those files should be even seen or touched, this makes me think that the latest used method of waking a project dir may be not very efficient too.

@f-f
Copy link
Member

f-f commented Apr 29, 2024

I think the first step here is to figure put - likely with a small test - if the issue here occurs only on windows, since performance on mac and linux has improved significantly and spago is not looking at those files there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants