-
-
Notifications
You must be signed in to change notification settings - Fork 267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix crashing core.thread.fiber unittest for AArch64. #4648
Conversation
Please note that some of these kludges date back to when druntime was a separate submodule, so that changes weren't this trivial. And it's also not that long ago that we know whether we are compiled with optimizations on (version Sometimes I have no idea which unittest fails for some module, that was the case here IIRC, no usable info in the CI logs. And then often it's not clear whether it's due to some particular CI config, a specific LLVM version (the math gammafunction unittests on macOS arm64 now being green since LLVM 18) etc. I'm not a huge fan of disabling genuinely failing tests though - they are real failures and might be encountered in production too, and if there's just a little output line in a wall of log lines, I think it's easy to overlook and just assume that it's all working. Edit: Instead of, in the best case, trying to help to fix the last few failures. :) |
It was not meant as a complaint, just trying to improve things / add knowledge of exactly which test is failing. We should upstream this change, such that other druntime authors also see that the test is failing (I think the problem is migrating threads on a platform for which migration is bad). In any case, it would have already helped me really a lot if I had known that core.thread.fiber is broken for AArch64. We should document that inside the file, rather than in a CI script. Can be done like in this PR, or with a comment + disabling in CI script. |
One benefit of this selective test disable (vs disabling whole file in CI script) is that all the other tests in the file are being run and further regression is noticed. |
I'm seeing this from the perspective of a package manager / power user building and running the tests on a selected platform. Who's just interested in seeing if all is green (edit: well, and what/how much is failing), not looking for little 'hey, test is disabled' output lines to figure out that something like migrating fibers may not actually work on that platform. The only reason for these CI exceptions is to be notified on further regressions. Being able to narrow it down to individual tests is / would be nice, but ideally really only for CI, not for package managers etc. looking for proper tests. And without restricting this to |
This is only hypothetical, no? I don't think there are many (if any) package managers, and they just have to accept that LDC/DMD/.. is half broken and disable the test without knowing what actually is broken. Case in point: at weka we use LDC for AArch64, fibers are used, but not migrated, which I guess for your argument would mean that fibers should have been banned on the data point that the whole of core.thread.fiber is failing. More so: currently aarch64 fibers are not tested in release mode (obviously needed for performance); I was a bit surprised about that actually (my fault). However, there is no choice here: LDC must be used on AArch64, otherwise there is no product. Nevertheless, there could be better ways of displaying which tests are known to be failing. This is (one of the many) shortcomings of D's built-in unittests.
In this case they shouldn't be using D ;-) (there are many tests locally disabled for certain configs)
Easy to fix! Will fix it. |
It all boils down to how severe a failure is/appears. You seem to see this particular unittest failure/crash as a minor issue, something that can be disabled and hidden away in some runtime output (e.g., your line isn't visible in the CI logs; additionally, there's no context, no line number etc.). Finding out about this problem is anything but trivial if one isn't working on druntime and grepping for FIXME etc., wanting to check the AArch64 status on some specific OS. My gut feeling however is that this can be a severe problem - after all, this test works just fine on x86, but somehow crashes on AArch64 (at least sometimes) when enabling optimizations. Maybe it's just that the test is bad and needs a fixup, maybe it's the fiber code that forgets to save and restore some registers on AArch64, which would be very bad. |
You are right that it is not very discoverable, but neither is any of are test results: there are very few people who run testsuites (I never ran a testsuite of a C/C++/Python/... compiler, but use them all the time); and I think there is no-one who checks results of testsuites; I'm assuming everybody just downloads the LDC compiler and that's it.
The test is testing migration of fibers, which is explicitly marked as unsafe by druntime and you have to explicitly call What I'm more worried about is that we are not testing anything related to fibers and AArch64 (unoptimized tests for this stuff is not very relevant, imho). Indeed, if fibers forget to save/restore registers, currently that is not tested at all (!), whereas it could be if only this test is disabled. |
That's fortunately not entirely true: #4613 :)
I'm sorry, but again, I didn't know that it was this unittest that crashes. And IIRC, it fails pretty often on Linux, but very seldomly on macOS (I've just re-added that exception for GHA after seeing some failures lately). All I knew was that this module sporadically crashes on AArch64 with
Again, I didn't want to set up an emulator or abuse my phone for trying to hunt down the problem and see which unittest fails. And definitely not trying to hunt it down via CI, especially if it isn't consistent. Now that you figured out which test fails, that's progress and very much appreciated. If only we could restrict this test-exclusion to CI... ;) |
I just saw this scroll by – what a blast from the past.
You are probably aware of this, but the issue is described here: #666 It is target-dependent, as the way TLS variables are referenced (and the way those references are emitted/optimised by the LLVM backend) depends on the target ABI. |
Corollary: If the TLS caching issue does not explain the test failure (which can be verified by inspection of TLS codegen on AArch64), then the test indeed detects a real bug. |
Oh hey David! wave - thx for the link; I misremembered this as being an unclear issue that Joakim tried to deal with. If it's really down to TLS and that 'accidentally' working on x86, then I'd have no problem with disabling the test. |
Yes, but at a glance, it does not seems like that test actually relies on TLS. Thus, |
There would appear to be a bug in the test code, though: This store Edit: Ah, wait, didn't D specify sequentially consistent ordering for |
I can quite easily test this on my mac, it indeed does not crash often but usually does the first time it is compiled with slightly different settings (-O2, -O3). Already tried __gshared on |
|
I am dumb :( staring at this for too long. Thanks :) |
Making that an
|
This appears not to be the case: However, the latter does not fix the crashes. |
|
Oh wow, progress. I was expecting we need |
It seems like my original workaround, which included a comment (ldc-developers/druntime@db7f8ee), was dropped at some point. |
Oh, that might likely have been my bad, probably when |
This might be worth following up on: personally, I think default rw operations to shared data like this are a design mistake, but IIRC the spec did say something about accesses being atomic. |
Sure, no worries – and in any case, seeing as Johan added it to In either case, an updated comment should probably mention the strategy here: We aim to do the minimum necessary such that druntime task switching itself doesn't break because of TLS caching (as verified by the unit tests), but users are on their own as far as their code is concerned (hence the scary comments). |
@@ -2320,7 +2323,7 @@ unittest | |||
fibs[idx].call(); | |||
cont |= fibs[idx].state != Fiber.State.TERM; | |||
} | |||
locks[idx] = false; | |||
locks[idx].atomicStore(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could make this release
for clarity/to maybe trigger more threading bugs.
I've restored David's comment and am merging this now, as the now native macOS arm64 job from the GHA-main workflow needs it. [I've left the atomic-store memory order alone, as the prior CAS uses the default too.] |
Related to #4613
@kinke I think it is better to disable individual unittests like this, than to disable the whole file in the CI script. I spent quite some time on trying to debug this failing test, thinking it was due to musl libc, but instead I found out from #4613 that it is a general AArch64 issue (indeed, also fails on my mac). I would have liked to see that in the source file, rather than hidden in CI configuration. Similar to how we tag known failures of a lit tests inside that actual lit test. This simply ensures that users (and I ;-)) can expect to successfully run the testsuite as normal (without having to know an exclude list).