Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On a machine with many cores, snapshotting large repos is very CPU-intensive #4508

Open
mindajar opened this issue Sep 20, 2024 · 7 comments
Labels
🐛bug Something isn't working 🍎Mac

Comments

@mindajar
Copy link

Description

On a machine with many cores, snapshotting large repos is very CPU-intensive

Steps to Reproduce the Problem

  1. Check out a very large repo (here, n = ~150,000 files) on a machine with many cores (here, n = 24)
  2. Run jj st (or just jj) and measure the time it takes for the command to complete.
  3. export RAYON_NUM_THREADS=4
  4. Repeat step 2

Expected Behavior

Similar performance in both cases.

Actual Behavior

jj's default behavior:
2.28 real 0.34 user 32.26 sys
jj limited to four threads:
1.13 real 0.16 user 0.49 sys

This one took some doing and profiling to figure out, as it didn't immediately make sense that the same working copy is so much faster to work with on a much smaller machine.

Specifications

  • Platform: macOS
  • Version: 14, 15
@arxanas
Copy link
Contributor

arxanas commented Sep 20, 2024

Interesting. Can't look now, but it's possible we serialize all tree updates into a single channel and it's contending when there are more threads. It would be in the working copy snapshot code somewhere.

  • One fix could be to batch the channel sends.
  • Another idea could be to use Rayon's parallel iterator bridge to consume the channel contents (which would do batching/work stealing).
  • It might work to do an explicit map/reduce across worker threads, although I don't recall if Rayon easily supports this pattern.

To fix your immediate issue, you can also try enabling the Watchman fsmonitor.

@yuja
Copy link
Contributor

yuja commented Sep 20, 2024

I heard (iirc when I was working on Mercurial) that it's sometimes faster to scan directory entries sequentially than splitting jobs to worker processes, which tends to lead to random access. I don't know this is the case, though.

@mindajar
Copy link
Author

Not even a little bit urgent -- I was mostly bewildered at what I could possibly have broken on the big machine to make jj status peg all CPUs, and I couldn't stop poking at it until I figured it out :)

(watchman is currently broken in MacPorts, which is how I ended up here)

@PhilipMetzger PhilipMetzger added the 🐛bug Something isn't working label Sep 20, 2024
@thoughtpolice
Copy link
Member

thoughtpolice commented Sep 20, 2024

I can't reproduce this on my 32 core Zen 1 machine (Linux 6.10) with gecko-dev, which is ~1mil commits and 380k working set files. In fact it never gets slower, but it's nowhere close to linear speedup; 32 cores is only ~2x faster than 4 cores (1.20s vs 0.7s). Would you be willing to try this with a repository like gecko-dev and report back to see what it says? It would make it easier for baseline comparisons, at least.

I suspect two things:

  • macOS. As we all know, it's an operating system that is different from Linux.
  • Hardware. You're probably using one of those Mac Studios? I'm guessing it's the 16P+8E configuration. Trying to do this across every core probably has some consequence. I don't remember how to do this but there are ways to force apps to use specific subsets of cores. Would you be willing to test a few combinations of threads on different cores with various RAYON_NUM_THREADS? e.g. 16 P cores versus 8 P cores versus 8 E cores?

I don't have a Studio but I do have a M2 Air, which coincidentally dual boots Fedora. So, if I get a chance I can see how it all shakes out on both systems, Linux vs macOS, but it's only 4P+4E, so it's not going to be as big a deal I suspect.

If it turns out that some other core configuration gives big improvements we can probably make a change to the scheduling policy somehow before we use Rayon so jj sticks to the right settings; a blunt hammer can then be applied with some patch to achieve that.


Note that I couldn't reliably clone gecko-dev from GitHub in one-go due to network errors, so I had to clone a 1-height shallow repo and then 'saturate it' by unshallowing for it to work:

git clone https://github.com/mozilla/gecko-dev --depth 1
cd gecko-dev
git fetch --unshallow
jj git init --colocate

@mindajar
Copy link
Author

Yes, this is a 16P+8E Mac Studio.

I noticed while testing this that OS caches seem to get evicted pretty quickly; after not that many seconds, a re-run is noticeably slower. I don't understand why, but thought it was interesting.

I've not figured out how to control QoS to the degree you describe, but taskpolicy(8) offers some coarse-grained control. A (lightly edited for readability) transcript:

gecko-dev % jj version
jj 0.21.0-ac605d2e7bc71e462515f8c423fbc0437f18b363
gecko-dev % jj st
The working copy is clean
Working copy : snzmnpzp 24928d98 (empty) (no description set)
Parent commit: srvlssxw 50498861 master | Backed out changeset 58983adca2f1 (bug 1916328) for causing dt failures @ browser_parsable_css.js
gecko-dev % jj file list | wc -l
  373973
gecko-dev % echo $RAYON_NUM_THREADS

gecko-dev % hyperfine --warmup 3 'jj st'
Benchmark 1: jj st
  Time (mean ± σ):      4.099 s ±  0.517 s    [User: 2.834 s, System: 55.348 s]
  Range (min … max):    3.697 s …  5.237 s    10 runs

gecko-dev % hyperfine --warmup 3 'taskpolicy -c background jj st'
Benchmark 1: taskpolicy -c background jj st
  Time (mean ± σ):      6.803 s ±  0.418 s    [User: 5.987 s, System: 38.212 s]
  Range (min … max):    6.267 s …  7.599 s    10 runs

gecko-dev % hyperfine --warmup 3 'taskpolicy -c maintenance jj st'
Benchmark 1: taskpolicy -c maintenance jj st
  Time (mean ± σ):      6.938 s ±  0.431 s    [User: 6.578 s, System: 49.789 s]
  Range (min … max):    6.014 s …  7.399 s    10 runs

gecko-dev % hyperfine --warmup 3 'taskpolicy -c utility jj st'
Benchmark 1: taskpolicy -c utility jj st
  Time (mean ± σ):      4.249 s ±  0.371 s    [User: 2.839 s, System: 58.087 s]
  Range (min … max):    3.853 s …  5.065 s    10 runs

gecko-dev % export RAYON_NUM_THREADS=8
gecko-dev % hyperfine --warmup 3 'jj st'
Benchmark 1: jj st
  Time (mean ± σ):      2.341 s ±  0.018 s    [User: 1.710 s, System: 9.140 s]
  Range (min … max):    2.319 s …  2.376 s    10 runs

gecko-dev % hyperfine --warmup 3 'taskpolicy -c background jj st'
Benchmark 1: taskpolicy -c background jj st
  Time (mean ± σ):      6.951 s ±  0.447 s    [User: 5.700 s, System: 27.704 s]
  Range (min … max):    6.319 s …  7.838 s    10 runs

gecko-dev % hyperfine --warmup 3 'taskpolicy -c maintenance jj st'
Benchmark 1: taskpolicy -c maintenance jj st
  Time (mean ± σ):      7.003 s ±  0.786 s    [User: 5.561 s, System: 27.330 s]
  Range (min … max):    5.456 s …  8.334 s    10 runs

gecko-dev % hyperfine --warmup 3 'taskpolicy -c utility jj st'
Benchmark 1: taskpolicy -c utility jj st
  Time (mean ± σ):      2.567 s ±  0.110 s    [User: 1.731 s, System: 9.194 s]
  Range (min … max):    2.366 s …  2.692 s    10 runs

gecko-dev % export RAYON_NUM_THREADS=4
gecko-dev % hyperfine --warmup 3 'jj st'
Benchmark 1: jj st
  Time (mean ± σ):      3.232 s ±  0.279 s    [User: 1.427 s, System: 5.208 s]
  Range (min … max):    2.951 s …  3.898 s    10 runs

gecko-dev % hyperfine --warmup 3 'taskpolicy -c background jj st'
Benchmark 1: taskpolicy -c background jj st
  Time (mean ± σ):      9.691 s ±  0.729 s    [User: 5.024 s, System: 21.260 s]
  Range (min … max):    7.840 s … 10.256 s    10 runs

gecko-dev % hyperfine --warmup 3 'taskpolicy -c maintenance jj st'
Benchmark 1: taskpolicy -c maintenance jj st
  Time (mean ± σ):      9.670 s ±  0.735 s    [User: 4.990 s, System: 21.110 s]
  Range (min … max):    8.341 s … 10.393 s    10 runs

gecko-dev % hyperfine --warmup 3 'taskpolicy -c utility jj st'
Benchmark 1: taskpolicy -c utility jj st
  Time (mean ± σ):      3.784 s ±  0.211 s    [User: 1.476 s, System: 5.713 s]
  Range (min … max):    3.454 s …  4.170 s    10 runs

@jfchevrette
Copy link

jfchevrette commented Nov 26, 2024

I'm seeing the same issue on my M3 Pro mac

The repo I'm testing with is https://github.com/NixOS/nixpkgs/

$ jj version
jj 0.23.0

$ sw_vers
ProductName:            macOS
ProductVersion:         15.1
BuildVersion:           24B83

$ sysctl -n machdep.cpu.brand_string
Apple M3 Pro

$ echo $RAYON_NUM_THREADS

$ hyperfine --warmup 3 'jj st'
Benchmark 1: jj st
  Time (mean ± σ):      1.799 s ±  0.519 s    [User: 0.564 s, System: 17.117 s]
  Range (min … max):    1.233 s …  2.600 s    10 runs

$ export RAYON_NUM_THREADS=8
$ hyperfine --warmup 3 'jj st'
Benchmark 1: jj st
  Time (mean ± σ):     776.6 ms ±  33.5 ms    [User: 453.1 ms, System: 4901.4 ms]
  Range (min … max):   748.0 ms … 856.7 ms    10 runs

$ export RAYON_NUM_THREADS=4
$ hyperfine --warmup 3 'jj st'
Benchmark 1: jj st
  Time (mean ± σ):     454.3 ms ±   9.3 ms    [User: 357.5 ms, System: 1099.9 ms]
  Range (min … max):   445.3 ms … 475.3 ms    10 runs

$ export RAYON_NUM_THREADS=2
$ hyperfine --warmup 3 'jj st'
Benchmark 1: jj st
  Time (mean ± σ):     679.7 ms ±  14.9 ms    [User: 332.4 ms, System: 908.6 ms]
  Range (min … max):   649.3 ms … 704.0 ms    10 runs

yuja added a commit to yuja/jj that referenced this issue Dec 5, 2024
This change basically means two things:
 a. a directory scan isn't split into too many small jobs, and
 b. a directory scan isn't blocked by recursive visit_directory() calls.
Before, small jobs were created at each recursion depth, so there were silent
time slice before these jobs started producing work.

I don't know if this mitigates the issue jj-vcs#4508, but it's slightly faster on my
Linux machine (except for the "git" repo which doesn't have many directories to
parallelize jobs.) The implementation is also simpler.

matcher.visit(dir) is moved to caller because it's silly to spawn an empty job.
TreeState::snapshot() already checks that for the root path.

This change will allow us to build a set of deleted files per directory, which
will reduce the overhead to send "present" files over the channel.

Benchmark:
- jj-0: original
- jj-1: this patch

gecko-dev (large):
```
% JJ_CONFIG=/dev/null hyperfine --sort command --warmup 3 --runs 10 -L bin jj-0,jj-1,jj-2 \
  'target/release-with-debug/{bin} -R ~/mirrors/gecko-dev debug snapshot'
Benchmark 1: target/release-with-debug/jj-0 -R ~/mirrors/gecko-dev debug snapshot
  Time (mean ± σ):     707.4 ms ±  12.3 ms    [User: 2880.2 ms, System: 1939.1 ms]
  Range (min … max):   684.4 ms … 721.7 ms    10 runs

Benchmark 2: target/release-with-debug/jj-1 -R ~/mirrors/gecko-dev debug snapshot
  Time (mean ± σ):     659.6 ms ±  11.8 ms    [User: 2673.2 ms, System: 1882.1 ms]
  Range (min … max):   649.2 ms … 690.1 ms    10 runs

Relative speed comparison
        1.61 ±  0.05  target/release-with-debug/jj-0 -R ~/mirrors/gecko-dev debug snapshot
        1.50 ±  0.05  target/release-with-debug/jj-1 -R ~/mirrors/gecko-dev debug snapshot
```

linux (mid-size):
```
% JJ_CONFIG=/dev/null hyperfine --sort command --warmup 3 --runs 10 -L bin jj-0,jj-1,jj-2 \
  'target/release-with-debug/{bin} -R ~/mirrors/linux debug snapshot'
Benchmark 1: target/release-with-debug/jj-0 -R ~/mirrors/linux debug snapshot
  Time (mean ± σ):     256.3 ms ±   6.2 ms    [User: 573.0 ms, System: 459.2 ms]
  Range (min … max):   248.5 ms … 266.0 ms    10 runs

Benchmark 2: target/release-with-debug/jj-1 -R ~/mirrors/linux debug snapshot
  Time (mean ± σ):     225.5 ms ±   4.2 ms    [User: 563.3 ms, System: 475.8 ms]
  Range (min … max):   218.7 ms … 231.4 ms    10 runs

Relative speed comparison
        1.35 ±  0.04  target/release-with-debug/jj-0 -R ~/mirrors/linux debug snapshot
        1.19 ±  0.03  target/release-with-debug/jj-1 -R ~/mirrors/linux debug snapshot
```

git (small, flat):
```
% JJ_CONFIG=/dev/null hyperfine --sort command --warmup 3 --runs 100 -L bin jj-0,jj-1,jj-2 \
  'target/release-with-debug/{bin} -R ~/mirrors/git debug snapshot'
Benchmark 1: target/release-with-debug/jj-0 -R ~/mirrors/git debug snapshot
  Time (mean ± σ):      30.6 ms ±   1.6 ms    [User: 38.8 ms, System: 40.6 ms]
  Range (min … max):    28.1 ms …  43.8 ms    100 runs

Benchmark 2: target/release-with-debug/jj-1 -R ~/mirrors/git debug snapshot
  Time (mean ± σ):      34.0 ms ±   1.2 ms    [User: 33.5 ms, System: 47.5 ms]
  Range (min … max):    31.5 ms …  39.2 ms    100 runs

Relative speed comparison
        1.00          target/release-with-debug/jj-0 -R ~/mirrors/git debug snapshot
        1.11 ±  0.07  target/release-with-debug/jj-1 -R ~/mirrors/git debug snapshot
```

jj (small, structured):
```
% JJ_CONFIG=/dev/null hyperfine --sort command --warmup 3 --runs 100 -L bin jj-0,jj-1,jj-2 \
  'target/release-with-debug/{bin} -R . debug snapshot'
Benchmark 1: target/release-with-debug/jj-0 -R . debug snapshot
  Time (mean ± σ):      30.9 ms ±   1.6 ms    [User: 16.3 ms, System: 28.0 ms]
  Range (min … max):    28.9 ms …  41.8 ms    100 runs

Benchmark 2: target/release-with-debug/jj-1 -R . debug snapshot
  Time (mean ± σ):      31.2 ms ±   1.9 ms    [User: 15.3 ms, System: 27.0 ms]
  Range (min … max):    29.0 ms …  43.3 ms    100 runs

Relative speed comparison
        1.00          target/release-with-debug/jj-0 -R . debug snapshot
        1.01 ±  0.08  target/release-with-debug/jj-1 -R . debug snapshot
```

- CPU: 8-core AMD Ryzen 7 PRO 4750U with Radeon Graphics (-MT MCP-)
- speed/min/max: 1600/1400/1700 MHz Kernel: 6.11.10-amd64 x86_64
- Filesystem: ext4
yuja added a commit to yuja/jj that referenced this issue Dec 6, 2024
…threshold (WIP)

This change basically means two things:
 a. a directory scan isn't split into too many small jobs, and
 b. a directory scan isn't blocked by recursive visit_directory() calls.
Before, small jobs were created at each recursion depth, so there were silent
time slice before these jobs started producing work.

I don't know if this mitigates the issue jj-vcs#4508, but it's slightly faster on my
Linux machine (except for the "git" repo which doesn't have many directories to
parallelize jobs.) The implementation is also simpler.

matcher.visit(dir) is moved to caller because it's silly to spawn an empty job.
TreeState::snapshot() already checks that for the root path.

This change will allow us to build a set of deleted files per directory, which
will reduce the overhead to send "present" files over the channel.

TODO: update benchmark
Benchmark:
- jj-0: original
- jj-1: this patch

gecko-dev (large):
```
% JJ_CONFIG=/dev/null hyperfine --sort command --warmup 3 --runs 10 -L bin jj-0,jj-1,jj-2 \
  'target/release-with-debug/{bin} -R ~/mirrors/gecko-dev debug snapshot'
Benchmark 1: target/release-with-debug/jj-0 -R ~/mirrors/gecko-dev debug snapshot
  Time (mean ± σ):     707.4 ms ±  12.3 ms    [User: 2880.2 ms, System: 1939.1 ms]
  Range (min … max):   684.4 ms … 721.7 ms    10 runs

Benchmark 2: target/release-with-debug/jj-1 -R ~/mirrors/gecko-dev debug snapshot
  Time (mean ± σ):     659.6 ms ±  11.8 ms    [User: 2673.2 ms, System: 1882.1 ms]
  Range (min … max):   649.2 ms … 690.1 ms    10 runs

Relative speed comparison
        1.61 ±  0.05  target/release-with-debug/jj-0 -R ~/mirrors/gecko-dev debug snapshot
        1.50 ±  0.05  target/release-with-debug/jj-1 -R ~/mirrors/gecko-dev debug snapshot
```

linux (mid-size):
```
% JJ_CONFIG=/dev/null hyperfine --sort command --warmup 3 --runs 10 -L bin jj-0,jj-1,jj-2 \
  'target/release-with-debug/{bin} -R ~/mirrors/linux debug snapshot'
Benchmark 1: target/release-with-debug/jj-0 -R ~/mirrors/linux debug snapshot
  Time (mean ± σ):     256.3 ms ±   6.2 ms    [User: 573.0 ms, System: 459.2 ms]
  Range (min … max):   248.5 ms … 266.0 ms    10 runs

Benchmark 2: target/release-with-debug/jj-1 -R ~/mirrors/linux debug snapshot
  Time (mean ± σ):     225.5 ms ±   4.2 ms    [User: 563.3 ms, System: 475.8 ms]
  Range (min … max):   218.7 ms … 231.4 ms    10 runs

Relative speed comparison
        1.35 ±  0.04  target/release-with-debug/jj-0 -R ~/mirrors/linux debug snapshot
        1.19 ±  0.03  target/release-with-debug/jj-1 -R ~/mirrors/linux debug snapshot
```

git (small, flat):
```
% JJ_CONFIG=/dev/null hyperfine --sort command --warmup 3 --runs 100 -L bin jj-0,jj-1,jj-2 \
  'target/release-with-debug/{bin} -R ~/mirrors/git debug snapshot'
Benchmark 1: target/release-with-debug/jj-0 -R ~/mirrors/git debug snapshot
  Time (mean ± σ):      30.6 ms ±   1.6 ms    [User: 38.8 ms, System: 40.6 ms]
  Range (min … max):    28.1 ms …  43.8 ms    100 runs

Benchmark 2: target/release-with-debug/jj-1 -R ~/mirrors/git debug snapshot
  Time (mean ± σ):      34.0 ms ±   1.2 ms    [User: 33.5 ms, System: 47.5 ms]
  Range (min … max):    31.5 ms …  39.2 ms    100 runs

Relative speed comparison
        1.00          target/release-with-debug/jj-0 -R ~/mirrors/git debug snapshot
        1.11 ±  0.07  target/release-with-debug/jj-1 -R ~/mirrors/git debug snapshot
```

jj (small, structured):
```
% JJ_CONFIG=/dev/null hyperfine --sort command --warmup 3 --runs 100 -L bin jj-0,jj-1,jj-2 \
  'target/release-with-debug/{bin} -R . debug snapshot'
Benchmark 1: target/release-with-debug/jj-0 -R . debug snapshot
  Time (mean ± σ):      30.9 ms ±   1.6 ms    [User: 16.3 ms, System: 28.0 ms]
  Range (min … max):    28.9 ms …  41.8 ms    100 runs

Benchmark 2: target/release-with-debug/jj-1 -R . debug snapshot
  Time (mean ± σ):      31.2 ms ±   1.9 ms    [User: 15.3 ms, System: 27.0 ms]
  Range (min … max):    29.0 ms …  43.3 ms    100 runs

Relative speed comparison
        1.00          target/release-with-debug/jj-0 -R . debug snapshot
        1.01 ±  0.08  target/release-with-debug/jj-1 -R . debug snapshot
```

- CPU: 8-core AMD Ryzen 7 PRO 4750U with Radeon Graphics (-MT MCP-)
- speed/min/max: 1600/1400/1700 MHz Kernel: 6.11.10-amd64 x86_64
- Filesystem: ext4
yuja added a commit to yuja/jj that referenced this issue Dec 7, 2024
…threshold

This change basically means two things:
 a. a directory scan isn't split into too many small jobs, and
 b. a directory scan isn't blocked by recursive visit_directory() calls.
Before, small jobs were created at each recursion depth, so there were silent
time slice before these jobs started producing work.

I don't know if this mitigates the issue jj-vcs#4508, but it's slightly faster on my
Linux machine.

matcher.visit(dir) is moved to caller because it's silly to spawn an empty job.
TreeState::snapshot() already checks that for the root path.

Benchmark:
1. original
2. per-directory spawn (this patch)
3. per-directory deleted files (omitted)
4. shorter path comparison (omitted)

gecko-dev (~357k files, ~25k dirs)
```
% JJ_CONFIG=/dev/null hyperfine --sort command --warmup 3 --runs 30 ..
Benchmark 1: target/release-with-debug/jj-1 -R ~/mirrors/gecko-dev debug snapshot
  Time (mean ± σ):     764.9 ms ±  16.7 ms    [User: 3274.7 ms, System: 2183.3 ms]
  Range (min … max):   731.9 ms … 814.2 ms    30 runs

Benchmark 2: target/release-with-debug/jj-2 -R ~/mirrors/gecko-dev debug snapshot
  Time (mean ± σ):     710.7 ms ±   9.1 ms    [User: 3070.7 ms, System: 2142.6 ms]
  Range (min … max):   695.9 ms … 740.1 ms    30 runs

Relative speed comparison
        1.89 ±  0.05  target/release-with-debug/jj-1 -R ~/mirrors/gecko-dev debug snapshot
        1.76 ±  0.03  target/release-with-debug/jj-2 -R ~/mirrors/gecko-dev debug snapshot
```

linux (~87k files, ~6k dirs)
```
% JJ_CONFIG=/dev/null hyperfine --sort command --warmup 3 --runs 30 ..
Benchmark 1: target/release-with-debug/jj-1 -R ~/mirrors/linux debug snapshot
  Time (mean ± σ):     268.2 ms ±  11.3 ms    [User: 636.6 ms, System: 518.5 ms]
  Range (min … max):   247.5 ms … 295.2 ms    30 runs

Benchmark 2: target/release-with-debug/jj-2 -R ~/mirrors/linux debug snapshot
  Time (mean ± σ):     242.3 ms ±   3.3 ms    [User: 656.8 ms, System: 538.0 ms]
  Range (min … max):   236.9 ms … 252.3 ms    30 runs

Relative speed comparison
        1.40 ±  0.06  target/release-with-debug/jj-1 -R ~/mirrors/linux debug snapshot
        1.27 ±  0.03  target/release-with-debug/jj-2 -R ~/mirrors/linux debug snapshot
```

nixpkgs (~45k files, ~31k dirs)
```
% JJ_CONFIG=/dev/null hyperfine --sort command --warmup 3 --runs 30 ..
Benchmark 1: target/release-with-debug/jj-1 -R ~/mirrors/nixpkgs debug snapshot
  Time (mean ± σ):     201.0 ms ±   8.5 ms    [User: 929.3 ms, System: 917.6 ms]
  Range (min … max):   170.3 ms … 218.5 ms    30 runs

Benchmark 2: target/release-with-debug/jj-2 -R ~/mirrors/nixpkgs debug snapshot
  Time (mean ± σ):     190.7 ms ±   4.1 ms    [User: 859.3 ms, System: 881.1 ms]
  Range (min … max):   184.6 ms … 202.4 ms    30 runs

Relative speed comparison
        1.24 ±  0.06  target/release-with-debug/jj-1 -R ~/mirrors/nixpkgs debug snapshot
        1.18 ±  0.03  target/release-with-debug/jj-2 -R ~/mirrors/nixpkgs debug snapshot
```

git (~4.5k files, 0.2k dirs)
```
% JJ_CONFIG=/dev/null hyperfine --sort command --warmup 30 --runs 50 ..
Benchmark 1: target/release-with-debug/jj-1 -R ~/mirrors/git debug snapshot
  Time (mean ± σ):      30.3 ms ±   1.1 ms    [User: 40.5 ms, System: 39.4 ms]
  Range (min … max):    28.3 ms …  35.7 ms    50 runs

Benchmark 2: target/release-with-debug/jj-2 -R ~/mirrors/git debug snapshot
  Time (mean ± σ):      30.6 ms ±   1.1 ms    [User: 33.8 ms, System: 39.0 ms]
  Range (min … max):    29.0 ms …  35.0 ms    50 runs

Relative speed comparison
        1.05 ±  0.05  target/release-with-debug/jj-1 -R ~/mirrors/git debug snapshot
        1.06 ±  0.05  target/release-with-debug/jj-2 -R ~/mirrors/git debug snapshot
```

- CPU: 8-core AMD Ryzen 7 PRO 4750U with Radeon Graphics (-MT MCP-)
- speed/min/max: 1600/1400/1700 MHz Kernel: 6.11.10-amd64 x86_64
- Filesystem: ext4
yuja added a commit to yuja/jj that referenced this issue Dec 9, 2024
…threshold

This change basically means two things:
 a. a directory scan isn't split into too many small jobs, and
 b. a directory scan isn't blocked by recursive visit_directory() calls.
Before, small jobs were created at each recursion depth, so there were silent
time slice before these jobs started producing work.

I don't know if this mitigates the issue jj-vcs#4508, but it's slightly faster on my
Linux machine.

matcher.visit(dir) is moved to caller because it's silly to spawn an empty job.
TreeState::snapshot() already checks that for the root path.

Benchmark:
1. original
2. per-directory spawn (this patch)
3. per-directory deleted files (omitted)
4. shorter path comparison (omitted)

gecko-dev (~357k files, ~25k dirs)
```
% JJ_CONFIG=/dev/null hyperfine --sort command --warmup 3 --runs 30 ..
Benchmark 1: target/release-with-debug/jj-1 -R ~/mirrors/gecko-dev debug snapshot
  Time (mean ± σ):     764.9 ms ±  16.7 ms    [User: 3274.7 ms, System: 2183.3 ms]
  Range (min … max):   731.9 ms … 814.2 ms    30 runs

Benchmark 2: target/release-with-debug/jj-2 -R ~/mirrors/gecko-dev debug snapshot
  Time (mean ± σ):     710.7 ms ±   9.1 ms    [User: 3070.7 ms, System: 2142.6 ms]
  Range (min … max):   695.9 ms … 740.1 ms    30 runs

Relative speed comparison
        1.89 ±  0.05  target/release-with-debug/jj-1 -R ~/mirrors/gecko-dev debug snapshot
        1.76 ±  0.03  target/release-with-debug/jj-2 -R ~/mirrors/gecko-dev debug snapshot
```

linux (~87k files, ~6k dirs)
```
% JJ_CONFIG=/dev/null hyperfine --sort command --warmup 3 --runs 30 ..
Benchmark 1: target/release-with-debug/jj-1 -R ~/mirrors/linux debug snapshot
  Time (mean ± σ):     268.2 ms ±  11.3 ms    [User: 636.6 ms, System: 518.5 ms]
  Range (min … max):   247.5 ms … 295.2 ms    30 runs

Benchmark 2: target/release-with-debug/jj-2 -R ~/mirrors/linux debug snapshot
  Time (mean ± σ):     242.3 ms ±   3.3 ms    [User: 656.8 ms, System: 538.0 ms]
  Range (min … max):   236.9 ms … 252.3 ms    30 runs

Relative speed comparison
        1.40 ±  0.06  target/release-with-debug/jj-1 -R ~/mirrors/linux debug snapshot
        1.27 ±  0.03  target/release-with-debug/jj-2 -R ~/mirrors/linux debug snapshot
```

nixpkgs (~45k files, ~31k dirs)
```
% JJ_CONFIG=/dev/null hyperfine --sort command --warmup 3 --runs 30 ..
Benchmark 1: target/release-with-debug/jj-1 -R ~/mirrors/nixpkgs debug snapshot
  Time (mean ± σ):     201.0 ms ±   8.5 ms    [User: 929.3 ms, System: 917.6 ms]
  Range (min … max):   170.3 ms … 218.5 ms    30 runs

Benchmark 2: target/release-with-debug/jj-2 -R ~/mirrors/nixpkgs debug snapshot
  Time (mean ± σ):     190.7 ms ±   4.1 ms    [User: 859.3 ms, System: 881.1 ms]
  Range (min … max):   184.6 ms … 202.4 ms    30 runs

Relative speed comparison
        1.24 ±  0.06  target/release-with-debug/jj-1 -R ~/mirrors/nixpkgs debug snapshot
        1.18 ±  0.03  target/release-with-debug/jj-2 -R ~/mirrors/nixpkgs debug snapshot
```

git (~4.5k files, 0.2k dirs)
```
% JJ_CONFIG=/dev/null hyperfine --sort command --warmup 30 --runs 50 ..
Benchmark 1: target/release-with-debug/jj-1 -R ~/mirrors/git debug snapshot
  Time (mean ± σ):      30.3 ms ±   1.1 ms    [User: 40.5 ms, System: 39.4 ms]
  Range (min … max):    28.3 ms …  35.7 ms    50 runs

Benchmark 2: target/release-with-debug/jj-2 -R ~/mirrors/git debug snapshot
  Time (mean ± σ):      30.6 ms ±   1.1 ms    [User: 33.8 ms, System: 39.0 ms]
  Range (min … max):    29.0 ms …  35.0 ms    50 runs

Relative speed comparison
        1.05 ±  0.05  target/release-with-debug/jj-1 -R ~/mirrors/git debug snapshot
        1.06 ±  0.05  target/release-with-debug/jj-2 -R ~/mirrors/git debug snapshot
```

- CPU: 8-core AMD Ryzen 7 PRO 4750U with Radeon Graphics (-MT MCP-)
- speed/min/max: 1600/1400/1700 MHz Kernel: 6.11.10-amd64 x86_64
- Filesystem: ext4
yuja added a commit to yuja/jj that referenced this issue Dec 10, 2024
…threshold

This change basically means two things:
 a. a directory scan isn't split into too many small jobs, and
 b. a directory scan isn't blocked by recursive visit_directory() calls.
Before, small jobs were created at each recursion depth, so there were silent
time slice before these jobs started producing work.

I don't know if this mitigates the issue jj-vcs#4508, but it's slightly faster on my
Linux machine.

matcher.visit(dir) is moved to caller because it's silly to spawn an empty job.
TreeState::snapshot() already checks that for the root path.

Benchmark:
1. original
2. per-directory spawn (this patch)
3. per-directory deleted files (omitted)
4. shorter path comparison (omitted)

gecko-dev (~357k files, ~25k dirs)
```
% JJ_CONFIG=/dev/null hyperfine --sort command --warmup 3 --runs 30 ..
Benchmark 1: target/release-with-debug/jj-1 -R ~/mirrors/gecko-dev debug snapshot
  Time (mean ± σ):     764.9 ms ±  16.7 ms    [User: 3274.7 ms, System: 2183.3 ms]
  Range (min … max):   731.9 ms … 814.2 ms    30 runs

Benchmark 2: target/release-with-debug/jj-2 -R ~/mirrors/gecko-dev debug snapshot
  Time (mean ± σ):     710.7 ms ±   9.1 ms    [User: 3070.7 ms, System: 2142.6 ms]
  Range (min … max):   695.9 ms … 740.1 ms    30 runs

Relative speed comparison
        1.89 ±  0.05  target/release-with-debug/jj-1 -R ~/mirrors/gecko-dev debug snapshot
        1.76 ±  0.03  target/release-with-debug/jj-2 -R ~/mirrors/gecko-dev debug snapshot
```

linux (~87k files, ~6k dirs)
```
% JJ_CONFIG=/dev/null hyperfine --sort command --warmup 3 --runs 30 ..
Benchmark 1: target/release-with-debug/jj-1 -R ~/mirrors/linux debug snapshot
  Time (mean ± σ):     268.2 ms ±  11.3 ms    [User: 636.6 ms, System: 518.5 ms]
  Range (min … max):   247.5 ms … 295.2 ms    30 runs

Benchmark 2: target/release-with-debug/jj-2 -R ~/mirrors/linux debug snapshot
  Time (mean ± σ):     242.3 ms ±   3.3 ms    [User: 656.8 ms, System: 538.0 ms]
  Range (min … max):   236.9 ms … 252.3 ms    30 runs

Relative speed comparison
        1.40 ±  0.06  target/release-with-debug/jj-1 -R ~/mirrors/linux debug snapshot
        1.27 ±  0.03  target/release-with-debug/jj-2 -R ~/mirrors/linux debug snapshot
```

nixpkgs (~45k files, ~31k dirs)
```
% JJ_CONFIG=/dev/null hyperfine --sort command --warmup 3 --runs 30 ..
Benchmark 1: target/release-with-debug/jj-1 -R ~/mirrors/nixpkgs debug snapshot
  Time (mean ± σ):     201.0 ms ±   8.5 ms    [User: 929.3 ms, System: 917.6 ms]
  Range (min … max):   170.3 ms … 218.5 ms    30 runs

Benchmark 2: target/release-with-debug/jj-2 -R ~/mirrors/nixpkgs debug snapshot
  Time (mean ± σ):     190.7 ms ±   4.1 ms    [User: 859.3 ms, System: 881.1 ms]
  Range (min … max):   184.6 ms … 202.4 ms    30 runs

Relative speed comparison
        1.24 ±  0.06  target/release-with-debug/jj-1 -R ~/mirrors/nixpkgs debug snapshot
        1.18 ±  0.03  target/release-with-debug/jj-2 -R ~/mirrors/nixpkgs debug snapshot
```

git (~4.5k files, 0.2k dirs)
```
% JJ_CONFIG=/dev/null hyperfine --sort command --warmup 30 --runs 50 ..
Benchmark 1: target/release-with-debug/jj-1 -R ~/mirrors/git debug snapshot
  Time (mean ± σ):      30.3 ms ±   1.1 ms    [User: 40.5 ms, System: 39.4 ms]
  Range (min … max):    28.3 ms …  35.7 ms    50 runs

Benchmark 2: target/release-with-debug/jj-2 -R ~/mirrors/git debug snapshot
  Time (mean ± σ):      30.6 ms ±   1.1 ms    [User: 33.8 ms, System: 39.0 ms]
  Range (min … max):    29.0 ms …  35.0 ms    50 runs

Relative speed comparison
        1.05 ±  0.05  target/release-with-debug/jj-1 -R ~/mirrors/git debug snapshot
        1.06 ±  0.05  target/release-with-debug/jj-2 -R ~/mirrors/git debug snapshot
```

- CPU: 8-core AMD Ryzen 7 PRO 4750U with Radeon Graphics (-MT MCP-)
- speed/min/max: 1600/1400/1700 MHz Kernel: 6.11.10-amd64 x86_64
- Filesystem: ext4
yuja added a commit that referenced this issue Dec 10, 2024
…threshold

This change basically means two things:
 a. a directory scan isn't split into too many small jobs, and
 b. a directory scan isn't blocked by recursive visit_directory() calls.
Before, small jobs were created at each recursion depth, so there were silent
time slice before these jobs started producing work.

I don't know if this mitigates the issue #4508, but it's slightly faster on my
Linux machine.

matcher.visit(dir) is moved to caller because it's silly to spawn an empty job.
TreeState::snapshot() already checks that for the root path.

Benchmark:
1. original
2. per-directory spawn (this patch)
3. per-directory deleted files (omitted)
4. shorter path comparison (omitted)

gecko-dev (~357k files, ~25k dirs)
```
% JJ_CONFIG=/dev/null hyperfine --sort command --warmup 3 --runs 30 ..
Benchmark 1: target/release-with-debug/jj-1 -R ~/mirrors/gecko-dev debug snapshot
  Time (mean ± σ):     764.9 ms ±  16.7 ms    [User: 3274.7 ms, System: 2183.3 ms]
  Range (min … max):   731.9 ms … 814.2 ms    30 runs

Benchmark 2: target/release-with-debug/jj-2 -R ~/mirrors/gecko-dev debug snapshot
  Time (mean ± σ):     710.7 ms ±   9.1 ms    [User: 3070.7 ms, System: 2142.6 ms]
  Range (min … max):   695.9 ms … 740.1 ms    30 runs

Relative speed comparison
        1.89 ±  0.05  target/release-with-debug/jj-1 -R ~/mirrors/gecko-dev debug snapshot
        1.76 ±  0.03  target/release-with-debug/jj-2 -R ~/mirrors/gecko-dev debug snapshot
```

linux (~87k files, ~6k dirs)
```
% JJ_CONFIG=/dev/null hyperfine --sort command --warmup 3 --runs 30 ..
Benchmark 1: target/release-with-debug/jj-1 -R ~/mirrors/linux debug snapshot
  Time (mean ± σ):     268.2 ms ±  11.3 ms    [User: 636.6 ms, System: 518.5 ms]
  Range (min … max):   247.5 ms … 295.2 ms    30 runs

Benchmark 2: target/release-with-debug/jj-2 -R ~/mirrors/linux debug snapshot
  Time (mean ± σ):     242.3 ms ±   3.3 ms    [User: 656.8 ms, System: 538.0 ms]
  Range (min … max):   236.9 ms … 252.3 ms    30 runs

Relative speed comparison
        1.40 ±  0.06  target/release-with-debug/jj-1 -R ~/mirrors/linux debug snapshot
        1.27 ±  0.03  target/release-with-debug/jj-2 -R ~/mirrors/linux debug snapshot
```

nixpkgs (~45k files, ~31k dirs)
```
% JJ_CONFIG=/dev/null hyperfine --sort command --warmup 3 --runs 30 ..
Benchmark 1: target/release-with-debug/jj-1 -R ~/mirrors/nixpkgs debug snapshot
  Time (mean ± σ):     201.0 ms ±   8.5 ms    [User: 929.3 ms, System: 917.6 ms]
  Range (min … max):   170.3 ms … 218.5 ms    30 runs

Benchmark 2: target/release-with-debug/jj-2 -R ~/mirrors/nixpkgs debug snapshot
  Time (mean ± σ):     190.7 ms ±   4.1 ms    [User: 859.3 ms, System: 881.1 ms]
  Range (min … max):   184.6 ms … 202.4 ms    30 runs

Relative speed comparison
        1.24 ±  0.06  target/release-with-debug/jj-1 -R ~/mirrors/nixpkgs debug snapshot
        1.18 ±  0.03  target/release-with-debug/jj-2 -R ~/mirrors/nixpkgs debug snapshot
```

git (~4.5k files, 0.2k dirs)
```
% JJ_CONFIG=/dev/null hyperfine --sort command --warmup 30 --runs 50 ..
Benchmark 1: target/release-with-debug/jj-1 -R ~/mirrors/git debug snapshot
  Time (mean ± σ):      30.3 ms ±   1.1 ms    [User: 40.5 ms, System: 39.4 ms]
  Range (min … max):    28.3 ms …  35.7 ms    50 runs

Benchmark 2: target/release-with-debug/jj-2 -R ~/mirrors/git debug snapshot
  Time (mean ± σ):      30.6 ms ±   1.1 ms    [User: 33.8 ms, System: 39.0 ms]
  Range (min … max):    29.0 ms …  35.0 ms    50 runs

Relative speed comparison
        1.05 ±  0.05  target/release-with-debug/jj-1 -R ~/mirrors/git debug snapshot
        1.06 ±  0.05  target/release-with-debug/jj-2 -R ~/mirrors/git debug snapshot
```

- CPU: 8-core AMD Ryzen 7 PRO 4750U with Radeon Graphics (-MT MCP-)
- speed/min/max: 1600/1400/1700 MHz Kernel: 6.11.10-amd64 x86_64
- Filesystem: ext4
@yuja
Copy link
Contributor

yuja commented Dec 10, 2024

Some of these changes might fix or mitigate the problem:
27c2882, 99d8703, 8caec18

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛bug Something isn't working 🍎Mac
Projects
None yet
Development

No branches or pull requests

6 participants