-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multithreaded evaluator #10938
base: master
Are you sure you want to change the base?
Multithreaded evaluator #10938
Conversation
This is a mapping from paths to "resolved" paths (i.e. with `default.nix` added, if appropriate). `fileParseCache` and `fileEvalCache` are now keyed on the resolved path *only*.
Previously, the optimistic concurrency approach in `evalFile()` meant that a `nix search nixpkgs ^` would do hundreds of duplicated parsings/evaluations. Now, we reuse the thunk locking mechanism to ensure it's done only once.
This refactoring allows the symbol table to be stored as something other than std::strings.
This allows symbol IDs to be offsets into an arena whose base offset never moves, and can therefore be dereferenced without any locks.
This makes it less likely that we concurrently execute tasks that would block on a common subtask, e.g. evaluating `libfoo` and `libfoo_variant` are likely to have common dependencies.
Yeah, @tomberek, @djacu, and I were testing this last night and that was what came up to being what to graph.
I believe it was a failure but hyperfine didn't mark it as one. Rerunning the test for 51 cores gave similar numbers as 50 and 52.
Yeah, idk how nix reports errors and how hyperfine marks runs as errors. Eventually one time it did fail but adding
This error only sometimes pops up and appears to be random, only 1 run seems to get it every time I ran the complete benchmark sequence. |
That's not normal and likely caused by race conditions that this PR hasn't addressed yet. |
Yeah, that's the assumption. Unfortunately, I haven't found a common way to reliably reproduce the error. However, I have seen it at 10, 14, 32, and maybe 51 core counts. I wonder if there's some sort of commonality between those core counts which caused a race condition. |
I would expect the process to be sufficiently chaotic that the number doesn't really matter. |
I did not have errors with a script setup similar to @RossComputerGuy. For reference I am running a AMD Ryzen Threadripper 2950X 16-Core Processor. I did not increase However, when I increased the number of runs for hyperfine I started getting spurious errors. script hyperfine \
--runs 10 \
--parameter-scan num_threads 1 24 \
--export-json output.json \
--show-output \
"env NR_CORES={num_threads} GC_INITIAL_HEAP_SIZE=17G /nix/store/g4yy6s999kdbwqcjdb26rvnw18njc986-nix-2.24.0pre20240726_67ff326/bin/nix flake show --no-eval-cache --all-systems --json github:NixOS/nix/afdd12be5e19c0001ff3297dea544301108d298" errors SPURIOUS 0x7f7659435200
SPURIOUS 0x7f7659435200
SPURIOUS 0x7f7659435200
SPURIOUS 0x7f7659435200
SPURIOUS 0x7f77e7cd63a0
SPURIOUS 0x7f7659418700
error (ignored): error:
… in the left operand of the update (//) operator
at /nix/store/8ghgpja4h0rpxhhcldv94hzsr923bl8n-source/tests/nixos/default.nix:19:5:
18| })
19| // {
| ^
20| # allow running tests against older nix versions via `nix eval --apply`
… while evaluating the attribute 'value'
at /nix/store/i4nv0mdcx8iifh3r71qd0pbp8al8kp1z-source/lib/modules.nix:809:9:
808| in warnDeprecation opt //
809| { value = builtins.addErrorContext "while evaluating the option `${showOption loc}':" value;
| ^
810| inherit (res.defsFinal') highestPrio;
… while evaluating the option `result':
(stack trace truncated; use '--show-trace' to show the full, detailed trace)
error: expected a list but found the partially applied built-in function 'map': «partially applied primop map» |
I see lines like $ cat output.json | jq -r '.results | .[] | [.parameters.num_threads, .median] | @tsv' | uplot lineplot
┌─────────────────────────────────────────────────┐
40 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠱⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠑⡄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠘⡄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠈⢆⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠈⠑⢄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠑⠢⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠢⢄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠢⢄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠑⠢⠤⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⠉⠒⠒⠢⠤⠤⠤⠤⢄⣀⣀⡀⠀⠀⠀⠀⠀⠀│
10 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⠉⠑⠒⠒⠒│
└─────────────────────────────────────────────────┘
1 9 $ cat output.json | jq -r '.results | .[] | [.parameters.num_threads, .median] | @tsv' | uplot barplot
┌ ┐
1 ┤■■■■■■■■■■■■■■■■■■■■ 32.359621253680004
2 ┤■■■■■■■■■■■■■■■ 23.72844723518
3 ┤■■■■■■■■■■■■ 19.23701058818
4 ┤■■■■■■■■■■ 15.88976078368
5 ┤■■■■■■■■■ 14.07459731268
6 ┤■■■■■■■■ 13.020973145180001
7 ┤■■■■■■■■ 12.56286847868
8 ┤■■■■■■■ 11.62728899418
9 ┤■■■■■■■ 11.13275702618
└ ┘ |
I found that 2 runs isn't sufficient to produce statistically meaningful results and occasionally a there would be runs that have wildly different times. This is why I tried to go for 10 runs. It would be nice if the output data could output $ cat process.sh
cat output.json \
| jq -r '
.results
| .[0].user as $first_user
| .[0].system as $first_system
| .[]
| [.parameters.num_threads, (.user + .system) / ($first_user + $first_system)]
| @tsv
' \
| uplot lineplot $ ./process.sh
┌─────────────────────────────────────────────────┐
3 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡠│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⣀⠤⠤⠔⠊⠉⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡠⠤⠒⠊⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⠤⠔⠒⠊⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⠤⠤⠒⠊⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⣀⠤⠒⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⡠⠒⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⡠⠊⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
1 │⡠⠊⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
└─────────────────────────────────────────────────┘
1 9 switched to ┌ ┐
1 ┤■■■■■■■■■■ 1.0
2 ┤■■■■■■■■■■■■■ 1.339834980540174
3 ┤■■■■■■■■■■■■■■■ 1.5154225766675506
4 ┤■■■■■■■■■■■■■■■■ 1.6041827253996215
5 ┤■■■■■■■■■■■■■■■■■ 1.7081109410792572
6 ┤■■■■■■■■■■■■■■■■■■ 1.8005707699572884
7 ┤■■■■■■■■■■■■■■■■■■■ 1.9191750807373242
8 ┤■■■■■■■■■■■■■■■■■■■■ 1.9985132513328854
9 ┤■■■■■■■■■■■■■■■■■■■■■ 2.121231560486825
└ ┘ |
We should never call reset() on a value (such as vRes) than can be seen by another thread. This was causing random failures about 'partially applied built-in function' etc.
@djacu @RossComputerGuy The random "partially applied built-in function" failures should be fixed now (839aec2). |
Sweet, thank you for fixing that. I'll take a look tonight. |
@edolstra updated numbers $ cat bench.sh
hyperfine \
--runs 3 \
--parameter-scan num_threads 1 24 \
--export-json output-2024-08-13-runs3-threads1-24.json \
--show-output \
"env NR_CORES={num_threads} GC_INITIAL_HEAP_SIZE=17G /nix/store/cvfbjr9kw0piqd8w740z9s2aplr2xbsp-nix-2.25.0pre20240813_d36ea2e/bin/nix flake show --no-eval-cache --all-systems --json github:NixOS/nix/afdd12be5e19c0001ff3297dea544301108d298" $ cat process.sh
cat $1 \
| jq -r '
.results
| .[0].user as $first_user
| .[0].system as $first_system
| .[]
| [.parameters.num_threads, (.user + .system) / ($first_user + $first_system)]
| @tsv
' \
| uplot $2 $ ./process.sh output-2024-08-13-runs3-threads1-24.json lineplot
┌────────────────────────────────────────┐
4 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⢄⡀⣀⣀⡠⠊⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢠⠢⠤⠒⠁⠀⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⠒⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡀⠀⡔⠒⠒⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡜⠈⠊⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⠚⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⡠⠔⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⡤⠊⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⡠⠎⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⡜⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
1 │⠀⣰⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
└────────────────────────────────────────┘
0 30
$ cat process1.sh
cat $1 \
| jq -r '
.results
| .[]
| [.parameters.num_threads, .median]
| @tsv
' \
| uplot $2
|
Ran with the same exact setup as before except
|
Better illustration of the performance increase per additional thread $ cat process_time.sh
cat $1 \
| jq -r '
.results
| [.[].median] as $median
| [ $median[:-1], $median[1:] ]
| transpose
| map( .[0] - .[1] )
' \
| uplot $2 \
--title 'Absolute Time Improvement'
cat $1 \
| jq -r '
.results
| [.[].median] as $median
| [ $median[:-1], $median[1:] ]
| transpose
| map( ( .[0] - .[1] ) / .[0] * 100 )
' \
| uplot $2 \
--title '% Time Improvement'
|
This fixes a crash in Printer if a value is in a failed state.
Relevant paper with a design that implements this.
It's quite a different architecture, but maybe we need that? Specifically, they check blackholes on the stack during GC. I guess the alternative is to add more bookkeeping so that perhaps another thread could be tasked with periodically checking that no thread is deadlocked, and if they are, insert exception thunks, unblock them, and let them unwind their stacks. Fwiw, they have
It sure would be nice not to regress single-threaded performance, and have a boost in multi-threaded as well. I wouldn't oppose taking control of the stack instead of delegating all of that completely to bdwgc. Doing so would open the door to other benefits, such as more efficient |
Motivation
This PR makes the evaluator thread-safe. Currently, only
nix flake search
andnix flake show
make use of multi-threaded evaluation to achieve a speedup on multicore systems.Unlike the previous attempt at a multi-threaded evaluator, this one locks thunks to prevent them from being evaluated more than once. The life of a thunk is now:
forceValue()
call, the thunk type goes fromtThunk
totPending
.forceValue()
on a thunk in thetPending
state, it acquires a lock to register itself as "awaiting" that value, and sets the type totAwaited
.tAwaited
, it updates the value and wakes up the threads that are waiting. If the type istPending
, it just updates the value normally.Also, there now is a
tFailed
value type that stores an exception pointer to represent the case where thunk evaluation throws an exception. In that case, every thread that forces the thunk should get the same exception.To enable multi-threaded evaluation, you need to set the
NR_CORES
environment variable to the number of threads to use. You can also setNIX_SHOW_THREAD_STATS=1
to get some debug statistics.Some benchmark results on a Ryzen 5900X with 12 cores and 24 hyper-threads:
NR_CORES=12 GC_INITIAL_HEAP_SIZE=8G nix flake show --no-eval-cache --all-systems --json github:NixOS/nix/afdd12be5e19c0001ff3297dea544301108d298
went from 23.70s to 5.77s.NR_CORES=16 GC_INITIAL_HEAP_SIZE=6G time nix search --no-eval-cache github:NixOS/nixpkgs/bf8462aeba50cc753971480f613fbae0747cffc0?narHash=sha256-bPyv7hsbtuxyL6LLKtOYL6QsmPeFWP839BZQMd3RoUg%3D ^
went from 11.82s to 3.88s.Note: it's good to set
GC_INITIAL_HEAP_SIZE
to a high value because stop-the-world garbage collection is expensive.To do:
nix flake check
.Executor
class currently executes work items in random order to reduce the probability that we execute a bunch of items at the same time that all depend on the same thunk, causing all but one to be blocked. This can probably be improved.Context
Priorities and Process
Add 👍 to pull requests you find important.
The Nix maintainer team uses a GitHub project board to schedule and track reviews.