-
-
Notifications
You must be signed in to change notification settings - Fork 746
disable unittest on FreeBSD that is spuriously stalling the CI pipelines #10876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for your pull request, @rainers! Bugzilla referencesYour PR doesn't reference any Bugzilla issue. If your PR contains non-trivial changes, please reference a Bugzilla issue or create a manual changelog. Testing this PR locallyIf you don't have a local development environment setup, you can use Digger to test this PR: dub run digger -- build "master + phobos#10876" |
076a331 to
752a682
Compare
thewilsonator
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise looks good
752a682 to
274b8e7
Compare
|
The failure on Alpine looks pretty similar here. Is that platform detectable, too? Recent macOS timeout failures seem different, these just seem to run on slower machines sometimes. |
|
@rainers there ought to be one thread that has the spinlock though, or did it crash? |
I don't have the log of the gdb session anymore, and failed to reproduce it again so far. If I remember correctly, there was the main thread waiting in As the process was started from within gdb I guess it would have triggered on a crash (or doesn't it happen when Errors are handled in a thread?). |
|
@ibuclaw I could reproduce the issue again by raising the number of threads in that test to 1000, while still having to slow down execution by parallel compilation. Situation as described, no exceptional thread exits reported, and the spinlock value is 1. |
|
After setting the spinlock value to 0 manually in the debugger and continuing execution, I noticed that a failed assert was reported. So I set a breakpoint at which is As no unwinding happens on the AssertError, the lock is held by this thread forever. Not sure why no assert message is thrown, but maybe the main thread is expected to handle the error? So there is no issue with the spinlock, but the allocator_list. |
|
The Error is killing the thread, without killing the process by default.
I wanted to change this behavior, but it wasn't approved as part of the
exception overhaul a few months back (I haven't implement it yet).
…On Fri, 3 Oct 2025, 20:45 Rainer Schuetze, ***@***.***> wrote:
*rainers* left a comment (dlang/phobos#10876)
<#10876 (comment)>
After setting the spinlock value to 0 manually in the debugger and
continuing execution, I noticed that a failed assert was reported. So I set
a breakpoint at _d_assertp and tried to trigger the failure again for
about 50 times and hit it here:
Thread 149 hit Breakpoint 4, _d_assertp (file=0x802d170a0 <_TMP0> "std/experimental/allocator/building_blocks/allocator_list.d", line=500) at src/core/exception.d:798
798 onAssertError(file[0 .. strlen(file)], line);
(gdb) bt
#0 _d_assertp (file=0x802d170a0 <_TMP0> "std/experimental/allocator/building_blocks/allocator_list.d", line=500) at src/core/exception.d:798
#1 0x0000000804ce1e30 in std.experimental.allocator.building_blocks.allocator_list.AllocatorList!(std.experimental.allocator.building_blocks.allocator_list.SharedAllocatorList!(std.experimental.allocator.building_blocks.allocator_list.__unittest_L1331_C9().__lambda_L1340_C26, std.experimental.allocator.mallocator.Mallocator).Factory, std.experimental.allocator.mallocator.Mallocator).AllocatorList.deallocate(void[]) (this=..., b=...) at std/experimental/allocator/building_blocks/allocator_list.d:500
#2 0x0000000804ce0c9d in std.experimental.allocator.building_blocks.allocator_list.SharedAllocatorList!(std.experimental.allocator.building_blocks.allocator_list.SharedAllocatorList!(std.experimental.allocator.building_blocks.allocator_list.__unittest_L1331_C9().__lambda_L1340_C26, std.experimental.allocator.mallocator.Mallocator).Factory, std.experimental.allocator.mallocator.Mallocator).SharedAllocatorList.deallocate(void[]) shared (
***@***.***: <unknown type in generated/freebsd/debug/64/unittest/libphobos2-ut.so, CU 0x390ee0d, DIE 0x39bcbba>, b=...)
at std/experimental/allocator/building_blocks/allocator_list.d:753
#3 0x0000000804cb8c8b in std.experimental.allocator.building_blocks.allocator_list.__unittest_L1331_C9().fun() (__capture=0x8002fa030)
at std/experimental/allocator/building_blocks/allocator_list.d:1353
#4 0x00000008055a5f11 in core.thread.context.Callable.opCall() (this=...) at src/core/thread/context.d:46
#5 0x00000008055a2e11 in core.thread.threadbase.ThreadBase.run() (this=0x80030bc00) at src/core/thread/threadbase.d:440
#6 0x00000008055a4555 in core.thread.osthread.Thread.run() (this=0x80030bc00) at src/core/thread/osthread.d:331
#7 0x00000008055a5677 in thread_entryPoint (arg=0x806a1b630) at src/core/thread/osthread.d:2545
which is
https://github.com/dlang/phobos/blob/master/std/experimental/allocator/building_blocks/allocator_list.d#L500
As no unwinding happens on the AssertError, the lock is held by this
thread forever. Not sure why no assert message is thrown, but maybe the
main thread is expected to handle the error?
So there is no issue with the spinlock, but the allocator_list.
—
Reply to this email directly, view it on GitHub
<#10876 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHSL44GWF6EBTFYEIKFQXT3VYSQJAVCNFSM6AAAAACH7D5G6CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTGNRUGYZTMOJYG4>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
That is a terrible idea. You don't want to bring down the parent just because a worker thread died. |
Of course nobody WANTS that. But compared to an Error being silently swallowed by default, I think it's the more desirable option. |
That's what C# does, so evidently there are at least a few people out there who think it isn't so terrible. It's probably not a behavior you'd want to retrofit onto an existing language/runtime, though. |
You kinda do want it to be the default, this isn't the first time this year it's come up as a pretty hard-to-debug problem. But as Paul said above, retrofitting this isn't all that good a thing. Hence, not approved. |
|
Confirmed that test can fail on all platforms. Either in deallocate
|
|
There is a key nuance to C#'s behavior that D does not implement. Everything is an exception, including this. As long as you trap and handle the exception the app won't crash. One code base I work on has had a handler for this exception for 20 years. In fact most apps will have a default handler for this exception, mostly they just dump it to a log, but they don't crash over it. The reality is that, in practice, we almost never want to abort over a dead thread. So YMMV, but whatever mechanism is chosen the ability to trap and log+ignore should be readily accessible, because in nearly any case that counts, that's the behavior that people are going to use. |
Just checked notes, looks like I forgot that the filter got approved.
|
Do you have an idea how to fix it? Or should we disable the test for all platforms for now? BTW: calling |
Not sure where it's going wrong, will try adding locks to the unittest itself later to see if there's any function call that specifically triggers. Though I am more suspicious of |
|
A few notes:
|
I happen to have reproduced the stall in phobos testing on FreeBSD that is plaguing the CI here and in the dmd repo. The process
generated/freebsd/debug/64/unittest/test_runner std.experimental.allocator.building_blocks.allocator_listdid not terminate running 87 threads. Unfortuantely gdb could not show reasonable callstacks.I could reproduce it again, by running some processes in parallel to slow down the execution of the test, in this case building dmd tests. The test is waiting for threads to join, and all remaining threads show this callstack:
So they are failing to get the spinlock, Let's see whether disabling this test makes the CI more reliable...
Just noticed that this has been reported before: #10730 and that LDC disables the test of this module.