Fix crashing core.thread.fiber unittest for AArch64. #4648

JohanEngelen · 2024-05-06T20:41:11Z

Related to #4613

@kinke I think it is better to disable individual unittests like this, than to disable the whole file in the CI script. I spent quite some time on trying to debug this failing test, thinking it was due to musl libc, but instead I found out from #4613 that it is a general AArch64 issue (indeed, also fails on my mac). I would have liked to see that in the source file, rather than hidden in CI configuration. Similar to how we tag known failures of a lit tests inside that actual lit test. This simply ensures that users (and I ;-)) can expect to successfully run the testsuite as normal (without having to know an exclude list).

kinke · 2024-05-06T22:08:22Z

Please note that some of these kludges date back to when druntime was a separate submodule, so that changes weren't this trivial. And it's also not that long ago that we know whether we are compiled with optimizations on (version D_Optimized), which can now be used to skip tests with enabled optimizations only, like this one here, which you'd currently skip for debug tests too (plus not failing consistently, only sporadically IIRC).

Sometimes I have no idea which unittest fails for some module, that was the case here IIRC, no usable info in the CI logs. And then often it's not clear whether it's due to some particular CI config, a specific LLVM version (the math gammafunction unittests on macOS arm64 now being green since LLVM 18) etc.

I'm not a huge fan of disabling genuinely failing tests though - they are real failures and might be encountered in production too, and if there's just a little output line in a wall of log lines, I think it's easy to overlook and just assume that it's all working. Edit: Instead of, in the best case, trying to help to fix the last few failures. :)

JohanEngelen · 2024-05-06T23:18:01Z

It was not meant as a complaint, just trying to improve things / add knowledge of exactly which test is failing. We should upstream this change, such that other druntime authors also see that the test is failing (I think the problem is migrating threads on a platform for which migration is bad).

In any case, it would have already helped me really a lot if I had known that core.thread.fiber is broken for AArch64. We should document that inside the file, rather than in a CI script. Can be done like in this PR, or with a comment + disabling in CI script.

JohanEngelen · 2024-05-06T23:19:54Z

One benefit of this selective test disable (vs disabling whole file in CI script) is that all the other tests in the file are being run and further regression is noticed.

kinke · 2024-05-06T23:36:03Z

We should upstream this change, such that other druntime authors also see that the test is failing (I think the problem is migrating threads on a platform for which migration is bad).

I'm seeing this from the perspective of a package manager / power user building and running the tests on a selected platform. Who's just interested in seeing if all is green (edit: well, and what/how much is failing), not looking for little 'hey, test is disabled' output lines to figure out that something like migrating fibers may not actually work on that platform.

The only reason for these CI exceptions is to be notified on further regressions. Being able to narrow it down to individual tests is / would be nice, but ideally really only for CI, not for package managers etc. looking for proper tests.

And without restricting this to D_Optimized as mentioned above (and ideally LDC for visibility unless really upstreaming this), this wouldn't run this unittest for debug tests anymore, which is currently done (note the $ at the end of the regex, so that only the release tests are excluded).

JohanEngelen · 2024-05-07T16:43:12Z

We should upstream this change, such that other druntime authors also see that the test is failing (I think the problem is migrating threads on a platform for which migration is bad).

I'm seeing this from the perspective of a package manager / power user building and running the tests on a selected platform. Who's just interested in seeing if all is green (edit: well, and what/how much is failing), not looking for little 'hey, test is disabled' output lines to figure out that something like migrating fibers may not actually work on that platform.

This is only hypothetical, no? I don't think there are many (if any) package managers, and they just have to accept that LDC/DMD/.. is half broken and disable the test without knowing what actually is broken. Case in point: at weka we use LDC for AArch64, fibers are used, but not migrated, which I guess for your argument would mean that fibers should have been banned on the data point that the whole of core.thread.fiber is failing. More so: currently aarch64 fibers are not tested in release mode (obviously needed for performance); I was a bit surprised about that actually (my fault). However, there is no choice here: LDC must be used on AArch64, otherwise there is no product.

Nevertheless, there could be better ways of displaying which tests are known to be failing. This is (one of the many) shortcomings of D's built-in unittests.

The only reason for these CI exceptions is to be notified on further regressions. Being able to narrow it down to individual tests is / would be nice, but ideally really only for CI, not for package managers etc. looking for proper tests.

In this case they shouldn't be using D ;-) (there are many tests locally disabled for certain configs)

And without restricting this to D_Optimized as mentioned above (and ideally LDC for visibility unless really upstreaming this), this wouldn't run this unittest for debug tests anymore, which is currently done (note the $ at the end of the regex, so that only the release tests are excluded).

Easy to fix! Will fix it.

kinke · 2024-05-07T16:58:31Z

It all boils down to how severe a failure is/appears. You seem to see this particular unittest failure/crash as a minor issue, something that can be disabled and hidden away in some runtime output (e.g., your line isn't visible in the CI logs; additionally, there's no context, no line number etc.). Finding out about this problem is anything but trivial if one isn't working on druntime and grepping for FIXME etc., wanting to check the AArch64 status on some specific OS.

My gut feeling however is that this can be a severe problem - after all, this test works just fine on x86, but somehow crashes on AArch64 (at least sometimes) when enabling optimizations. Maybe it's just that the test is bad and needs a fixup, maybe it's the fiber code that forgets to save and restore some registers on AArch64, which would be very bad.

JohanEngelen · 2024-05-07T18:00:55Z

It all boils down to how severe a failure is/appears. You seem to see this particular unittest failure/crash as a minor issue, something that can be disabled and hidden away in some runtime output (e.g., your line isn't visible in the CI logs; additionally, there's no context, no line number etc.). Finding out about this problem is anything but trivial if one isn't working on druntime and grepping for FIXME etc., wanting to check the AArch64 status on some specific OS.

You are right that it is not very discoverable, but neither is any of are test results: there are very few people who run testsuites (I never ran a testsuite of a C/C++/Python/... compiler, but use them all the time); and I think there is no-one who checks results of testsuites; I'm assuming everybody just downloads the LDC compiler and that's it.
It took me several hours to discover that this test failure is already known :( so that was a pretty bad experience (thinking it related to musl, e.g. stack size, in some way). A simple comment at start of unittest would at least already help a little, so I'll add that in any case.
Also: the test sporadically fails, so even if someone runs the testsuite (doing it now), it is quite likely it won't be detected either.

My gut feeling however is that this can be a severe problem - after all, this test works just fine on x86, but somehow crashes on AArch64 (at least sometimes) when enabling optimizations. Maybe it's just that the test is bad and needs a fixup, maybe it's the fiber code that forgets to save and restore some registers on AArch64, which would be very bad.

The test is testing migration of fibers, which is explicitly marked as unsafe by druntime and you have to explicitly call allowMigration (whose description screams: don't do this). I think the test is bad, but can't tell exactly what part is bad (apart from migrating and accessing TLS data).

What I'm more worried about is that we are not testing anything related to fibers and AArch64 (unoptimized tests for this stuff is not very relevant, imho). Indeed, if fibers forget to save/restore registers, currently that is not tested at all (!), whereas it could be if only this test is disabled.

kinke · 2024-05-07T18:23:22Z

You are right that it is not very discoverable, but neither is any of are test results: there are very few people who run testsuites (I never ran a testsuite of a C/C++/Python/... compiler, but use them all the time); and I think there is no-one who checks results of testsuites; I'm assuming everybody just downloads the LDC compiler and that's it.

That's fortunately not entirely true: #4613 :)

It took me several hours to discover that this test failure is already known :( so that was a pretty bad experience (thinking it related to musl, e.g. stack size, in some way). A simple comment at start of unittest would at least already help a little, so I'll add that in any case.

I'm sorry, but again, I didn't know that it was this unittest that crashes. And IIRC, it fails pretty often on Linux, but very seldomly on macOS (I've just re-added that exception for GHA after seeing some failures lately). All I knew was that this module sporadically crashes on AArch64 with -O. I agree that a comment/FIXME in the source is nice. As is hopefully clear by now, I'm only worried about discoverability of unclear failures - in my view, there's a very strong distinction between disabling a test for CI ('required' for making it easier to discover further regressions), and disabling a test altogether.

What I'm more worried about is that we are not testing anything related to fibers and AArch64 (unoptimized tests for this stuff is not very relevant, imho). Indeed, if fibers forget to save/restore registers, currently that is not tested at all (!), whereas it could be if only this test is disabled.

Again, I didn't want to set up an emulator or abuse my phone for trying to hunt down the problem and see which unittest fails. And definitely not trying to hunt it down via CI, especially if it isn't consistent.

Now that you figured out which test fails, that's progress and very much appreciated. If only we could restrict this test-exclusion to CI... ;)

dnadlinger · 2024-05-07T18:27:45Z

I just saw this scroll by – what a blast from the past.

I think the test is bad, but can't tell exactly what part is bad (apart from migrating and accessing TLS data).

You are probably aware of this, but the issue is described here: #666 It is target-dependent, as the way TLS variables are referenced (and the way those references are emitted/optimised by the LLVM backend) depends on the target ABI.

dnadlinger · 2024-05-07T18:28:59Z

Corollary: If the TLS caching issue does not explain the test failure (which can be verified by inspection of TLS codegen on AArch64), then the test indeed detects a real bug.

kinke · 2024-05-07T18:33:16Z

Oh hey David! wave - thx for the link; I misremembered this as being an unclear issue that Joakim tried to deal with. If it's really down to TLS and that 'accidentally' working on x86, then I'd have no problem with disabling the test.

dnadlinger · 2024-05-07T18:41:30Z

Yes, but at a glance, it does not seems like that test actually relies on TLS. Thus, allowMigration() should in fact be safe, and it is a genuine issue. (I was just going off Johan's comment earlier, but I am not sure where TLS would actually be used.)

dnadlinger · 2024-05-07T19:16:12Z

There would appear to be a bug in the test code, though: This store
https://github.com/ldc-developers/ldc/blob/40ad5f7583fe90a85fd675bfc0cfc287d565c94a/runtime/druntime/src/core/thread/fiber.d#L2323C21-L2323C29
needs to have release semantics. It does by default on x86_64 as far as the processor is concerned, i.e. unless the compiler were to move it around, but not on AArch64.

Edit: Ah, wait, didn't D specify sequentially consistent ordering for shared accesses by default? That's not it, then.

JohanEngelen · 2024-05-07T19:20:40Z

I can quite easily test this on my mac, it indeed does not crash often but usually does the first time it is compiled with slightly different settings (-O2, -O3). Already tried __gshared on fibs which indeed does not help.

dnadlinger · 2024-05-07T19:21:09Z

fibs is stack memory anyway, though, and not TLS?

JohanEngelen · 2024-05-07T19:24:01Z

fibs is stack memory anyway, though, and not TLS?

I am dumb :( staring at this for too long. Thanks :)

kinke · 2024-05-07T20:34:12Z

Making that an atomicStore explicitly might not hurt. What might be interesting is #666 (comment), preventing optimizations for key fiber functions:

We now have core.volatile, so I wonder if the sm_this accesses could be done through volatileLoad. So far the only workaround that works 3/4s of the time is to disable inlining of switchIn, switchOut, and getThis. Setting optimization level of these (and maybe some others) functions to -O0 has a 100% success rate, but that is a non-standard extension.

JohanEngelen · 2024-05-07T20:36:00Z

There would appear to be a bug in the test code, though: This store https://github.com/ldc-developers/ldc/blob/40ad5f7583fe90a85fd675bfc0cfc287d565c94a/runtime/druntime/src/core/thread/fiber.d#L2323C21-L2323C29 needs to have release semantics. It does by default on x86_64 as far as the processor is concerned, i.e. unless the compiler were to move it around, but not on AArch64.

Edit: Ah, wait, didn't D specify sequentially consistent ordering for shared accesses by default? That's not it, then.

This appears not to be the case:
locks[idx] = false; --> store i8 0, ptr %2, align 1 Just a normal store.
locks[idx].atomicStore(false); --> store atomic i8 0, ptr %2 seq_cst, align 1

However, the latter does not fix the crashes.

JohanEngelen · 2024-05-07T20:48:31Z

Making that an atomicStore explicitly might not hurt. What might be interesting is #666 (comment), preventing optimizations for key fiber functions:

We now have core.volatile, so I wonder if the sm_this accesses could be done through volatileLoad. So far the only workaround that works 3/4s of the time is to disable inlining of switchIn, switchOut, and getThis. Setting optimization level of these (and maybe some others) functions to -O0 has a 100% success rate, but that is a non-standard extension.

switchIn and switchOut already have a pragma(inline, false). I've now also applied it to getThis. This indeed seems to fix the problem entirely on aarch64 alpine linux. Before it was almost 100% crash, now 0%.

kinke · 2024-05-07T20:54:05Z

Oh wow, progress. I was expecting we need @optStrategy("none") according to Iain's comment. Doesn't look like there's much to gain by optimizing these after a quick glance, and there are comments saying this has to be executed in exactly this order etc.

dnadlinger · 2024-05-07T20:56:46Z

It seems like my original workaround, which included a comment (ldc-developers/druntime@db7f8ee), was dropped at some point.

kinke · 2024-05-07T20:59:00Z

Oh, that might likely have been my bad, probably when core.thread was split up into a package, with a bigger refactoring.

dnadlinger · 2024-05-07T21:02:54Z

This appears not to be the case: locks[idx] = false; --> store i8 0, ptr %2, align 1 Just a normal store. locks[idx].atomicStore(false); --> store atomic i8 0, ptr %2 seq_cst, align 1

However, the latter does not fix the crashes.

This might be worth following up on: personally, I think default rw operations to shared data like this are a design mistake, but IIRC the spec did say something about accesses being atomic.

dnadlinger · 2024-05-07T21:08:33Z

Oh, that might likely have been my bad, probably when core.thread was split up into a package, with a bigger refactoring.

Sure, no worries – and in any case, seeing as Johan added it to Fiber.getThis(), not ThreadBase.getThis(), the surrounding code might have changed in a way requiring a careful review anyway.

In either case, an updated comment should probably mention the strategy here: We aim to do the minimum necessary such that druntime task switching itself doesn't break because of TLS caching (as verified by the unit tests), but users are on their own as far as their code is concerned (hence the scary comments).

dnadlinger · 2024-05-07T21:09:24Z

runtime/druntime/src/core/thread/fiber.d

@@ -2320,7 +2323,7 @@ unittest
                        fibs[idx].call();
                        cont |= fibs[idx].state != Fiber.State.TERM;
                    }
-                    locks[idx] = false;
+                    locks[idx].atomicStore(false);


Could make this release for clarity/to maybe trigger more threading bugs.

kinke · 2024-05-10T20:39:35Z

I've restored David's comment and am merging this now, as the now native macOS arm64 job from the GHA-main workflow needs it. [I've left the atomic-store memory order alone, as the prior CAS uses the default too.]

Disable crashing core.thread.fiber unittest for AArch64.

a26e771

Fix core.thread.fiber unittest

7a1c761

JohanEngelen mentioned this pull request May 7, 2024

Add musl libc CI using Alpine Linux container on Ubuntu GHA #4650

Merged

6 tasks

JohanEngelen changed the title ~~Disable crashing core.thread.fiber unittest for AArch64.~~ Fix crashing core.thread.fiber unittest for AArch64. May 7, 2024

dnadlinger reviewed May 7, 2024

View reviewed changes

kinke mentioned this pull request May 10, 2024

GHA: Promote macOS-arm64 cross-compilation job to full native job #4541

Merged

[restore David's original comment, minimally updated]

26d93f7

kinke enabled auto-merge (squash) May 10, 2024 20:36

kinke merged commit fa4f032 into ldc-developers:master May 10, 2024
23 checks passed

JohanEngelen deleted the aarch64_fiber branch May 16, 2024 16:41

JohanEngelen mentioned this pull request May 25, 2024

SEGFAULT in druntime-test-thread/external_threads on aarch64-musl #3403

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix crashing core.thread.fiber unittest for AArch64. #4648

Fix crashing core.thread.fiber unittest for AArch64. #4648

JohanEngelen commented May 6, 2024

kinke commented May 6, 2024 •

edited

Loading

JohanEngelen commented May 6, 2024

JohanEngelen commented May 6, 2024

kinke commented May 6, 2024 •

edited

Loading

JohanEngelen commented May 7, 2024

kinke commented May 7, 2024

JohanEngelen commented May 7, 2024

kinke commented May 7, 2024 •

edited

Loading

dnadlinger commented May 7, 2024

dnadlinger commented May 7, 2024

kinke commented May 7, 2024

dnadlinger commented May 7, 2024

dnadlinger commented May 7, 2024 •

edited

Loading

JohanEngelen commented May 7, 2024

dnadlinger commented May 7, 2024

JohanEngelen commented May 7, 2024

kinke commented May 7, 2024

JohanEngelen commented May 7, 2024

JohanEngelen commented May 7, 2024

kinke commented May 7, 2024

dnadlinger commented May 7, 2024

kinke commented May 7, 2024

dnadlinger commented May 7, 2024

dnadlinger commented May 7, 2024 •

edited

Loading

dnadlinger May 7, 2024

kinke commented May 10, 2024 •

edited

Loading

Fix crashing core.thread.fiber unittest for AArch64. #4648

Fix crashing core.thread.fiber unittest for AArch64. #4648

Conversation

JohanEngelen commented May 6, 2024

kinke commented May 6, 2024 • edited Loading

JohanEngelen commented May 6, 2024

JohanEngelen commented May 6, 2024

kinke commented May 6, 2024 • edited Loading

JohanEngelen commented May 7, 2024

kinke commented May 7, 2024

JohanEngelen commented May 7, 2024

kinke commented May 7, 2024 • edited Loading

dnadlinger commented May 7, 2024

dnadlinger commented May 7, 2024

kinke commented May 7, 2024

dnadlinger commented May 7, 2024

dnadlinger commented May 7, 2024 • edited Loading

JohanEngelen commented May 7, 2024

dnadlinger commented May 7, 2024

JohanEngelen commented May 7, 2024

kinke commented May 7, 2024

JohanEngelen commented May 7, 2024

JohanEngelen commented May 7, 2024

kinke commented May 7, 2024

dnadlinger commented May 7, 2024

kinke commented May 7, 2024

dnadlinger commented May 7, 2024

dnadlinger commented May 7, 2024 • edited Loading

dnadlinger May 7, 2024

Choose a reason for hiding this comment

kinke commented May 10, 2024 • edited Loading

kinke commented May 6, 2024 •

edited

Loading

kinke commented May 6, 2024 •

edited

Loading

kinke commented May 7, 2024 •

edited

Loading

dnadlinger commented May 7, 2024 •

edited

Loading

dnadlinger commented May 7, 2024 •

edited

Loading

kinke commented May 10, 2024 •

edited

Loading