disable unittest on FreeBSD that is spuriously stalling the CI pipelines #10876

rainers · 2025-10-01T06:54:46Z

I happen to have reproduced the stall in phobos testing on FreeBSD that is plaguing the CI here and in the dmd repo. The process generated/freebsd/debug/64/unittest/test_runner std.experimental.allocator.building_blocks.allocator_list did not terminate running 87 threads. Unfortuantely gdb could not show reasonable callstacks.

I could reproduce it again, by running some processes in parallel to slow down the execution of the test, in this case building dmd tests. The test is waiting for threads to join, and all remaining threads show this callstack:

Thread 118 (LWP 153620 of process 29596):
#0  0x0000000805fb0a0a in sched_yield () from /lib/libc.so.7
#1  0x00000008055a4da1 in core.thread.osthread.Thread.yield() () at src/core/thread/osthread.d:1039
#2  0x0000000805578ccf in core.internal.spinlock.SpinLock.yield(ulong) shared (this=@0x8002fa050: <unknown type>, k=1853142) at src/core/internal/spinlock.d:57
#3  0x0000000805578c6a in core.internal.spinlock.SpinLock.lock() shared (this=@0x8002fa050: <unknown type>) at src/core/internal/spinlock.d:39
#4  0x0000000804ce0b8c in std.experimental.allocator.building_blocks.allocator_list.SharedAllocatorList!(std.experimental.allocator.building_blocks.allocator_list.SharedAllocatorList!(std.experimental.allocator.building_blocks.allocator_list.__unittest_L1331_C9().__lambda_L1340_C26, std.experimental.allocator.mallocator.Mallocator).Factory, std.experimental.allocator.mallocator.Mallocator).SharedAllocatorList.allocate(ulong) shared (this=@0x8002fa038: <unknown type>, s=512) at std/experimental/allocator/building_blocks/allocator_list.d:721
#5  0x0000000804cb8b93 in std.experimental.allocator.building_blocks.allocator_list.__unittest_L1331_C9().fun() (__capture=0x8002fa030) at std/experimental/allocator/building_blocks/allocator_list.d:1344
#6  0x00000008055a5f11 in core.thread.context.Callable.opCall() (this=...) at src/core/thread/context.d:46
#7  0x00000008055a2e11 in core.thread.threadbase.ThreadBase.run() (this=0x800309a00) at src/core/thread/threadbase.d:440
#8  0x00000008055a4555 in core.thread.osthread.Thread.run() (this=0x800309a00) at src/core/thread/osthread.d:331
#9  0x00000008055a5677 in thread_entryPoint (arg=0x806a1bb30) at src/core/thread/osthread.d:2545
#10 0x00000008002589e2 in ?? () from /lib/libthr.so.3
#11 0x0000000000000000 in ?? ()

So they are failing to get the spinlock, Let's see whether disabling this test makes the CI more reliable...

Just noticed that this has been reported before: #10730 and that LDC disables the test of this module.

dlang-bot · 2025-10-01T06:54:50Z

Thanks for your pull request, @rainers!

Bugzilla references

Your PR doesn't reference any Bugzilla issue.

If your PR contains non-trivial changes, please reference a Bugzilla issue or create a manual changelog.

Testing this PR locally

If you don't have a local development environment setup, you can use Digger to test this PR:

dub run digger -- build "master + phobos#10876"

thewilsonator

Otherwise looks good

std/experimental/allocator/building_blocks/allocator_list.d

rainers · 2025-10-02T06:35:47Z

The failure on Alpine looks pretty similar here. Is that platform detectable, too?

Recent macOS timeout failures seem different, these just seem to run on slower machines sometimes.

ibuclaw · 2025-10-02T07:58:53Z

@rainers there ought to be one thread that has the spinlock though, or did it crash?

rainers · 2025-10-02T17:07:51Z

there ought to be one thread that has the spinlock though, or did it crash?

I don't have the log of the gdb session anymore, and failed to reproduce it again so far. If I remember correctly, there was the main thread waiting in join, a couple of GC threads idling, and all others waiting for the lock with the same call stack (checked by scrolling through the output of thread apply bt, so no 100% guarantee).

As the process was started from within gdb I guess it would have triggered on a crash (or doesn't it happen when Errors are handled in a thread?).

rainers · 2025-10-03T06:51:25Z

@ibuclaw I could reproduce the issue again by raising the number of threads in that test to 1000, while still having to slow down execution by parallel compilation. Situation as described, no exceptional thread exits reported, and the spinlock value is 1.

rainers · 2025-10-03T07:44:43Z

After setting the spinlock value to 0 manually in the debugger and continuing execution, I noticed that a failed assert was reported. So I set a breakpoint at _d_assertp and tried to trigger the failure again for about 50 times and hit it here:

Thread 149 hit Breakpoint 4, _d_assertp (file=0x802d170a0 <_TMP0> "std/experimental/allocator/building_blocks/allocator_list.d", line=500) at src/core/exception.d:798
798             onAssertError(file[0 .. strlen(file)], line);
(gdb) bt
#0  _d_assertp (file=0x802d170a0 <_TMP0> "std/experimental/allocator/building_blocks/allocator_list.d", line=500) at src/core/exception.d:798
#1  0x0000000804ce1e30 in std.experimental.allocator.building_blocks.allocator_list.AllocatorList!(std.experimental.allocator.building_blocks.allocator_list.SharedAllocatorList!(std.experimental.allocator.building_blocks.allocator_list.__unittest_L1331_C9().__lambda_L1340_C26, std.experimental.allocator.mallocator.Mallocator).Factory, std.experimental.allocator.mallocator.Mallocator).AllocatorList.deallocate(void[]) (this=..., b=...) at std/experimental/allocator/building_blocks/allocator_list.d:500
#2  0x0000000804ce0c9d in std.experimental.allocator.building_blocks.allocator_list.SharedAllocatorList!(std.experimental.allocator.building_blocks.allocator_list.SharedAllocatorList!(std.experimental.allocator.building_blocks.allocator_list.__unittest_L1331_C9().__lambda_L1340_C26, std.experimental.allocator.mallocator.Mallocator).Factory, std.experimental.allocator.mallocator.Mallocator).SharedAllocatorList.deallocate(void[]) shared (
    this=@0x8002fa038: <unknown type in generated/freebsd/debug/64/unittest/libphobos2-ut.so, CU 0x390ee0d, DIE 0x39bcbba>, b=...)
    at std/experimental/allocator/building_blocks/allocator_list.d:753
#3  0x0000000804cb8c8b in std.experimental.allocator.building_blocks.allocator_list.__unittest_L1331_C9().fun() (__capture=0x8002fa030)
    at std/experimental/allocator/building_blocks/allocator_list.d:1353
#4  0x00000008055a5f11 in core.thread.context.Callable.opCall() (this=...) at src/core/thread/context.d:46
#5  0x00000008055a2e11 in core.thread.threadbase.ThreadBase.run() (this=0x80030bc00) at src/core/thread/threadbase.d:440
#6  0x00000008055a4555 in core.thread.osthread.Thread.run() (this=0x80030bc00) at src/core/thread/osthread.d:331
#7  0x00000008055a5677 in thread_entryPoint (arg=0x806a1b630) at src/core/thread/osthread.d:2545

which is

https://github.com/dlang/phobos/blob/master/std/experimental/allocator/building_blocks/allocator_list.d#L500

As no unwinding happens on the AssertError, the lock is held by this thread forever. Not sure why no assert message is thrown, but maybe the main thread is expected to handle the error?

So there is no issue with the spinlock, but the allocator_list.

rikkimax · 2025-10-03T07:47:54Z

The Error is killing the thread, without killing the process by default. I wanted to change this behavior, but it wasn't approved as part of the exception overhaul a few months back (I haven't implement it yet).

…

On Fri, 3 Oct 2025, 20:45 Rainer Schuetze, ***@***.***> wrote: *rainers* left a comment (dlang/phobos#10876) <#10876 (comment)> After setting the spinlock value to 0 manually in the debugger and continuing execution, I noticed that a failed assert was reported. So I set a breakpoint at _d_assertp and tried to trigger the failure again for about 50 times and hit it here: Thread 149 hit Breakpoint 4, _d_assertp (file=0x802d170a0 <_TMP0> "std/experimental/allocator/building_blocks/allocator_list.d", line=500) at src/core/exception.d:798 798 onAssertError(file[0 .. strlen(file)], line); (gdb) bt #0 _d_assertp (file=0x802d170a0 <_TMP0> "std/experimental/allocator/building_blocks/allocator_list.d", line=500) at src/core/exception.d:798 #1 0x0000000804ce1e30 in std.experimental.allocator.building_blocks.allocator_list.AllocatorList!(std.experimental.allocator.building_blocks.allocator_list.SharedAllocatorList!(std.experimental.allocator.building_blocks.allocator_list.__unittest_L1331_C9().__lambda_L1340_C26, std.experimental.allocator.mallocator.Mallocator).Factory, std.experimental.allocator.mallocator.Mallocator).AllocatorList.deallocate(void[]) (this=..., b=...) at std/experimental/allocator/building_blocks/allocator_list.d:500 #2 0x0000000804ce0c9d in std.experimental.allocator.building_blocks.allocator_list.SharedAllocatorList!(std.experimental.allocator.building_blocks.allocator_list.SharedAllocatorList!(std.experimental.allocator.building_blocks.allocator_list.__unittest_L1331_C9().__lambda_L1340_C26, std.experimental.allocator.mallocator.Mallocator).Factory, std.experimental.allocator.mallocator.Mallocator).SharedAllocatorList.deallocate(void[]) shared ( ***@***.***: <unknown type in generated/freebsd/debug/64/unittest/libphobos2-ut.so, CU 0x390ee0d, DIE 0x39bcbba>, b=...) at std/experimental/allocator/building_blocks/allocator_list.d:753 #3 0x0000000804cb8c8b in std.experimental.allocator.building_blocks.allocator_list.__unittest_L1331_C9().fun() (__capture=0x8002fa030) at std/experimental/allocator/building_blocks/allocator_list.d:1353 #4 0x00000008055a5f11 in core.thread.context.Callable.opCall() (this=...) at src/core/thread/context.d:46 #5 0x00000008055a2e11 in core.thread.threadbase.ThreadBase.run() (this=0x80030bc00) at src/core/thread/threadbase.d:440 #6 0x00000008055a4555 in core.thread.osthread.Thread.run() (this=0x80030bc00) at src/core/thread/osthread.d:331 #7 0x00000008055a5677 in thread_entryPoint (arg=0x806a1b630) at src/core/thread/osthread.d:2545 which is https://github.com/dlang/phobos/blob/master/std/experimental/allocator/building_blocks/allocator_list.d#L500 As no unwinding happens on the AssertError, the lock is held by this thread forever. Not sure why no assert message is thrown, but maybe the main thread is expected to handle the error? So there is no issue with the spinlock, but the allocator_list. — Reply to this email directly, view it on GitHub <#10876 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHSL44GWF6EBTFYEIKFQXT3VYSQJAVCNFSM6AAAAACH7D5G6CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTGNRUGYZTMOJYG4> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

ibuclaw · 2025-10-03T17:09:49Z

The Error is killing the thread, without killing the process by default.

I wanted to change this behavior, but it wasn't approved as part of the
exception overhaul a few months back (I haven't implement it yet).

That is a terrible idea. You don't want to bring down the parent just because a worker thread died.

Herringway · 2025-10-03T17:20:55Z

The Error is killing the thread, without killing the process by default.
I wanted to change this behavior, but it wasn't approved as part of the
exception overhaul a few months back (I haven't implement it yet).

That is a terrible idea. You don't want to bring down the parent just because a worker thread died.

Of course nobody WANTS that. But compared to an Error being silently swallowed by default, I think it's the more desirable option.

pbackus · 2025-10-03T17:24:32Z

That is a terrible idea. You don't want to bring down the parent just because a worker thread died.

That's what C# does, so evidently there are at least a few people out there who think it isn't so terrible.

It's probably not a behavior you'd want to retrofit onto an existing language/runtime, though.

rikkimax · 2025-10-03T20:12:03Z

The Error is killing the thread, without killing the process by default.
I wanted to change this behavior, but it wasn't approved as part of the
exception overhaul a few months back (I haven't implement it yet).

That is a terrible idea. You don't want to bring down the parent just because a worker thread died.

You kinda do want it to be the default, this isn't the first time this year it's come up as a pretty hard-to-debug problem.

But as Paul said above, retrofitting this isn't all that good a thing. Hence, not approved.

ibuclaw · 2025-10-03T22:25:05Z

Confirmed that test can fail on all platforms. Either in deallocate assert(owns(o) == Ternary.yes) or in the unittest itself in fun assert(a.expand(b1, 512)).

owns fails when the allocator is missing from the root tree, but adding a second "owns" test to check the allocators array instead shows that it is still present.

LightBender · 2025-10-03T23:15:04Z

@pbackus @rikkimax

There is a key nuance to C#'s behavior that D does not implement. Everything is an exception, including this. As long as you trap and handle the exception the app won't crash. One code base I work on has had a handler for this exception for 20 years. In fact most apps will have a default handler for this exception, mostly they just dump it to a log, but they don't crash over it.

The reality is that, in practice, we almost never want to abort over a dead thread. So YMMV, but whatever mechanism is chosen the ability to trap and log+ignore should be readily accessible, because in nearly any case that counts, that's the behavior that people are going to use.

rikkimax · 2025-10-04T01:13:13Z

@pbackus @rikkimax

There is a key nuance to C#'s behavior that D does not implement. Everything is an exception, including this. As long as you trap and handle the exception the app won't crash. One code base I work on has had a handler for this exception for 20 years. In fact most apps will have a default handler for this exception, mostly they just dump it to a log, but they don't crash over it.

The reality is that, in practice, we almost never want to abort over a dead thread. So YMMV, but whatever mechanism is chosen the ability to trap and log+ignore should be readily accessible, because in nearly any case that counts, that's the behavior that people are going to use.

Just checked notes, looks like I forgot that the filter got approved.

- Proposal: Thread entry point Throwable filtering
   Approved filter method, default doesn't change, introduce global function pointer for grave digger approach.

rainers · 2025-10-04T07:24:17Z

owns fails when the allocator is missing from the root tree, but adding a second "owns" test to check the allocators array instead shows that it is still present.

Do you have an idea how to fix it? Or should we disable the test for all platforms for now?

BTW: calling owns in an assert is kind of dubious, because it might modify the list of allocators.

ibuclaw · 2025-10-04T14:05:12Z

owns fails when the allocator is missing from the root tree, but adding a second "owns" test to check the allocators array instead shows that it is still present.

Do you have an idea how to fix it? Or should we disable the test for all platforms for now?

BTW: calling owns in an assert is kind of dubious, because it might modify the list of allocators.

Not sure where it's going wrong, will try adding locks to the unittest itself later to see if there's any function call that specifically triggers. Though I am more suspicious of moveAllocators for no reason whatsoever.

ibuclaw · 2025-10-06T09:27:37Z

A few notes:

Mallocator returns uninitialised memory, which at some point will just be garbage data, emplacing after allocating ought to be done just to make debugging easier if nothing else (also eyeing the test for whether a node is unused next == &this).
The shared lock is a SpinLock and not AlignedSpinLock - may matter more for atomic on non-86 arch's though.
There are locks, but what if impl is memoized? To be safe, hoisting out the implementation of AllocatorList into a private mixin template and adding the shared implementation via static if (isShared) - same as SharedAscendingPageAllocator - possibly replacing sensitive loads/stores with atomic as well for good measure.

rainers force-pushed the freebsd_freeze branch from 076a331 to 752a682 Compare October 1, 2025 07:07

rainers marked this pull request as draft October 1, 2025 07:21

thewilsonator approved these changes Oct 1, 2025

View reviewed changes

std/experimental/allocator/building_blocks/allocator_list.d Show resolved Hide resolved

disable unittest on FreeBSD that is stalling the CI pipelines

274b8e7

rainers force-pushed the freebsd_freeze branch from 752a682 to 274b8e7 Compare October 1, 2025 20:51

rainers marked this pull request as ready for review October 2, 2025 06:03

rainers changed the title ~~Draft: disable unittest on FreeBSD that is spuriously stalling the CI pipelines~~ disable unittest on FreeBSD that is spuriously stalling the CI pipelines Oct 2, 2025

thewilsonator merged commit 6c79372 into dlang:master Oct 2, 2025
9 of 10 checks passed

rainers mentioned this pull request Oct 3, 2025

std.experimental.allocator.building_blocks.allocator_list: Flaky unittests since v2.111 #10730

Open

rainers mentioned this pull request Oct 24, 2025

CI: fix building with dmd versions 2.089 to 2.093 dlang/dmd#22017

Merged

Uh oh!

disable unittest on FreeBSD that is spuriously stalling the CI pipelines #10876

disable unittest on FreeBSD that is spuriously stalling the CI pipelines #10876

Conversation

rainers commented Oct 1, 2025

Uh oh!

dlang-bot commented Oct 1, 2025

Bugzilla references

Testing this PR locally

Uh oh!

thewilsonator left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rainers commented Oct 2, 2025

Uh oh!

ibuclaw commented Oct 2, 2025

Uh oh!

Uh oh!

rainers commented Oct 2, 2025

Uh oh!

rainers commented Oct 3, 2025

Uh oh!

rainers commented Oct 3, 2025

Uh oh!

rikkimax commented Oct 3, 2025 via email

Uh oh!

ibuclaw commented Oct 3, 2025

Uh oh!

Herringway commented Oct 3, 2025

Uh oh!

pbackus commented Oct 3, 2025

Uh oh!

rikkimax commented Oct 3, 2025

Uh oh!

ibuclaw commented Oct 3, 2025

Uh oh!

LightBender commented Oct 3, 2025

Uh oh!

rikkimax commented Oct 4, 2025

Uh oh!

rainers commented Oct 4, 2025

Uh oh!

ibuclaw commented Oct 4, 2025

Uh oh!

ibuclaw commented Oct 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants