-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluation error in flake regression test suite #11141
Comments
One of these commits causes the GC to do a use-after-free. from the merge commit: https://github.com/NixOS/nix/pull/10835/commits Ran into a segfault on a particular dirty NixOS evaluation (ask @djacu for access). It had random errors that seemed to indicate a bad GC, or use-after-free. Either segfaults or errors related to values being of the wrong type. This makes it seem that GC was freeing something it should not, then re-using the memory, causing values to be corrupted. Running it with GC_DONT_GC would result in proper evals. Tested and bisected with @djacu and @JeremiahSecrist @roberth : this commit (master...djacu:nix:djacu/fix-gc-issue) fixes the error, but we're not quite sure that is the intended outcome of removing the patch. |
The sp_corrector callback is supposed to replace the old stack correcting patch, so that the GC would work properly when the stack pointer points into a coroutine stack instead of an OS-provided thread stack. Possible workarounds:
|
I've tried to reproduce by creating and checking garbage while filtering sources, and while my test does trigger garbage collections during source filtering, it may not depend on the main stack getting marked. Reproducer work in progress
let
inherit (builtins)
foldl'
isList
;
# Generate a tree of numbers, n deep, such that the numbers add up to (1 + salt) * 10^n.
# The salting makes the numbers all different, increasing the likelihood of catching
# any memory corruptions that might be caused by the GC or otherwise.
garbage = salt: n:
if n == 0
then [(1 + salt)]
else [
(garbage (10 * salt + 1) (n - 1))
(garbage (10 * salt - 1) (n - 1))
(garbage (10 * salt + 2) (n - 1))
(garbage (10 * salt - 2) (n - 1))
(garbage (10 * salt + 3) (n - 1))
(garbage (10 * salt - 3) (n - 1))
(garbage (10 * salt + 4) (n - 1))
(garbage (10 * salt - 4) (n - 1))
(garbage (10 * salt + 5) (n - 1))
(garbage (10 * salt - 5) (n - 1))
];
pow = base: n:
if n == 0
then 1
else base * (pow base (n - 1));
sumNestedLists = l:
if isList l
then foldl' (a: b: a + sumNestedLists b) 0 l
else l;
in
assert sumNestedLists (garbage 0 3) == pow 10 3;
assert sumNestedLists (garbage 0 6) == pow 10 6;
builtins.path {
path = ./src;
filter = path: type:
# We're not doing common subexpression elimination, so this reallocates
# the fairly big tree over and over, producing a lot of garbage during
# source filtering, whose filter runs in a coroutine.
assert sumNestedLists (garbage 0 4) == pow 10 4;
true;
} |
Reproducer: # Run:
# GC_INITIAL_HEAP_SIZE=$[1024 * 1024] NIX_SHOW_STATS=1 nix eval -f gc-coroutine-test.nix -vvvv
let
inherit (builtins)
foldl'
isList
;
# Generate a tree of numbers, n deep, such that the numbers add up to (1 + salt) * 10^n.
# The salting makes the numbers all different, increasing the likelihood of catching
# any memory corruptions that might be caused by the GC or otherwise.
garbage = salt: n:
if n == 0
then [(1 + salt)]
else [
(garbage (10 * salt + 1) (n - 1))
(garbage (10 * salt - 1) (n - 1))
(garbage (10 * salt + 2) (n - 1))
(garbage (10 * salt - 2) (n - 1))
(garbage (10 * salt + 3) (n - 1))
(garbage (10 * salt - 3) (n - 1))
(garbage (10 * salt + 4) (n - 1))
(garbage (10 * salt - 4) (n - 1))
(garbage (10 * salt + 5) (n - 1))
(garbage (10 * salt - 5) (n - 1))
];
pow = base: n:
if n == 0
then 1
else base * (pow base (n - 1));
sumNestedLists = l:
if isList l
then foldl' (a: b: a + sumNestedLists b) 0 l
else l;
in
assert sumNestedLists (garbage 0 3) == pow 10 3;
assert sumNestedLists (garbage 0 6) == pow 10 6;
builtins.foldl'
(a: b:
assert
"${
builtins.path {
path = ./src;
filter = path: type:
# We're not doing common subexpression elimination, so this reallocates
# the fairly big tree over and over, producing a lot of garbage during
# source filtering, whose filter runs in a coroutine.
assert sumNestedLists (garbage 0 3) == pow 10 3;
true;
}
}"
== "${./src}";
# These asserts don't seem necessary, as the lambda value get corrupted first
assert a.okay;
assert b.okay;
{ okay = true; }
)
{ okay = true; }
[ { okay = true; } { okay = true; } { okay = true; } ]
|
This hopefully avoids the need for all our Boehm GC coroutine workarounds, since the GC roots will be on the main stack of the thread. Fixes NixOS#11141.
Not certain about Darwin. I only tested this on my Linux mint machine. @tomberek and Jeremiah also tested it on their machines but I'm not sure what OS they were using at the time |
I was using nixos at the time. |
…ector Fix issue #11141 broken stack pointer corrector
This issue has been mentioned on NixOS Discourse. There might be relevant details there: https://discourse.nixos.org/t/2024-07-22-nix-team-meeting-minutes-163/49544/1 |
This hopefully avoids the need for all our Boehm GC coroutine workarounds, since the GC roots will be on the main stack of the thread. Fixes NixOS#11141.
This hopefully avoids the need for all our Boehm GC coroutine workarounds, since the GC roots will be on the main stack of the thread. Fixes NixOS#11141.
This hopefully avoids the need for all our Boehm GC coroutine workarounds, since the GC roots will be on the main stack of the thread. Fixes NixOS#11141.
This hopefully avoids the need for all our Boehm GC coroutine workarounds, since the GC roots will be on the main stack of the thread. Fixes NixOS#11141. (cherry picked from commit 048271e)
Describe the bug
The following test has recently started failing:
git bisect
pointed to b230c01 (#11014), however since that didn't change anything in the evaluator, and mysterious "attribute missing" errors are often GC-related, I ran the test script withGC_INITIAL_HEAP_SIZE=10G
and that made it succeed. So we have a GC bug somewhere.Steps To Reproduce
Expected behavior
A clear and concise description of what you expected to happen.
nix-env --version
outputAdditional context
Add any other context about the problem here.
Priorities
Add 👍 to issues you find important.
The text was updated successfully, but these errors were encountered: