-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LLVM backend regression] RuntimeError: unreachable #9443
Comments
A full stack trace (in a build with names, so e.g. |
This is potentially unrelated, however I also received the |
I am experiencing the same problem. I wanted to switch to the new backend but it tells me
|
@davlhd building with Once you find the function this happens in, it might be the same issue @TrevorSundberg said - LLVM will emit |
That did no really help:
|
@davlhd is there a chance the |
With "latest" this error disappeared. It now compiles and runs fine |
It appears that this has reappeared with emscripten 2.0.15. With adding
Downgrading to emscripten 2.0.14 resolves the issue. Also, if I compile with Any clue what could possibly go wrong? |
Hard to say. Quickest diagnosis is probably bisection, https://emscripten.org/docs/contributing/developers_guide.html#bisecting |
I can't do the bisection as my machines are too slow for compiling the LLVM, however, I've managed to create a reproducible case with only open-source components - it's available in this repository. To reproduce the issue:
If you do the same with emscripten 2.0.14, the application will work correctly. Also, if you replace Unlike in my project, adding
I hope this helps you with tracking down this bug. |
@DoDoENT The instructions in that link do not require compiling. It downloads precompiled builds using the emsdk. Sorry if they are not clear enough about that - we should improve them. @sbc100 also has had some recent improvements to the process which we meant to document (the emsdk now has a flag to install a build by the hash, which is slightly easier than editing the text file) but I don't think we have yet. |
You might also want to try running with asan and/or ubsan. |
Count me in. If I compile https://github.com/cloudpilot-emu/cloudpilot with 2.0.15 I get the follow trace at runtime:
The same code runs fine with 2.0.14 and also natively with ASAN and UBSAN enabled. It is worth mentioning that issue only appears if I build with Coincidentally (?) I also get link time errors on 2.0.15 with
I don't have time for bisecting now, but I can do so later this week. |
Thanks for the report @DirtyHairy. The LTO issue does look concerning, and we will probably need to bisect it down. I imagine its an llvm-side change. The |
opened #13736 for the |
Thanks alot @sbc100 ! I'll try to find time to bisect the issue over the next few days |
I have bisected the issue; the problematic commit to Indeed this is a LLVM change; unfortunately, the commit pulls in sizable number of changesets. I hope this helps nevertheless. |
Thanks! Yes, looks like there are 94 LLVM commits there: https://chromium.googlesource.com/emscripten-releases/+/7882b2e11c66da69b832a6ad7baa57c8ccbf9656 That will require someone to bisect locally on LLVM I think, unless one of the commits there looks relevant to the LLVM devs here. |
Well, this might take a few days, but I am not using my Macbook much during the day atm, so I think I can take the ring to mount doom 😏 |
As well as bisecting to find the culprit commit, it might also be useful to reduce the repro case to something smaller that might be give us more clues. |
The affected code is pretty much isolated, so I think this might not be too hard. I'll take a stab at it after bisecting. |
Well, that went down a lot faster than I thought. The faulty commit is: Seems like we indeed need a reduced testcase. I'll see what I can do. |
I've cooked up a reduced testcase. You can find the code here: https://github.com/DirtyHairy/emcc-lto-testcase Simply build with
The code runs fine with 2.0.14 or if From the trace I suspect that the deleted code is in the destructor of the |
Is there anything else that I can do to help addressing this issue? |
@DirtyHairy Please file an LLVM bug (and link to it here). As the commit you reduced to says, it sounds like they expected fallout from it. Attaching your testcase will hopefully be enough for them to investigate and/or decide whether to revert it. |
(Actually maybe it's faster to cc the author of that commit directly?) @preames , the last few comments have a reduced testcase and some investigation of a regression that was bisected down to your commit llvm/llvm-project@8666463 |
Happy to try and help here, but I really don't have much context on emscripten or node. So working from a test case which involves those pieces would be a bit challenging. (For instance, it's not clear to me whether your reduced test case was using ToT clang for the make step, or some other compiler?) If you want to help narrow this down, what would be really helpful is knowing what the IR looks like at the point the nofree transform actually happens. A simple way to get that would be to do a local build of clang w/the transform replaced with a bit of code which simply printed the current module and then crashed. (If it triggers multiple times, you might need to log and continue.) Once we had that, we could manually inspect the IR to see if the transform is legal. If it isn't, we revert my patch. If it is, we then have something to trace back through the print-after-all output to figure out where the wrong inference fact came from. As an aside, I just ran across an issue today (see the llvm-dev thread "Ambiguity in the nofree function attribute") around the semantics of the nofree function attribute. If you were using the allocsize attribute, then maybe that might explain a miscompile. That's really a long shot guess though, nothing more. |
Thanks for the quick response @preames ! Maybe another possibility is to convert the testcase to one that runs in ToT clang? It might be as simple as replacing I tried to do that quickly now, and I'm hitting an unrelated LLVM configuration issue I can't immediately figure out,
But if you have a working local build of LLVM ToT then it might just work for you. |
Thanks for the response, @preames / @kripken ! @preames I am not sure whether I understand top-of-tree correctly. Fwiw, I can reproduce the issue with the latest revision of LLVM main, including clang. I have just checked and cannot reproduce the bug if I build native binaries. However, as even with emscripten the bug only appears if I enable LTO at link I think that the bug may be tied to a transformation that happens during LTO when linking the emscripten libc. It seems that emscripten links to a libc that has been built with LTO enabled if I enable LTO during link? If this is the case, then this might be the decisive difference between emscripten and the native build (I don't think this happens with libc from the XCode SDK). Is there a way to disable this (use -flto objects, but not -flto libc) for testing purposes? @preames If you can give me a few pointers of where to instrument which file I can try to produce the IR dumps you described. |
Yeah, sorry, my bad, I just noticed that the sample above runs fine with
|
I think the simplest testcase here that I can reproduce locally is this one: // emcc -flto -O1
#include <map>
int main() {
std::map<int, int> foo;
foo[0] = 2;
} It does seem like the cause is LTO with libc++, as that is shared in all the testcases. I'd suggest making a native example of this for @preames by making a git repo with the cpp file plus a copy of libc++ from the emscripten tree (it's under |
Sorry, now I don't fully understand: how would I do a native build from this? I suppose I have to build libc++ first? |
Yes, but it's very simple: you will likely need very few of the libc++ In more detail, what you need to do is build with Another option could be to copy code from the libc++ sources into your testcase (edit: and keep copying more until the bug reproduces). |
Thanks 😏 I have put up a version of the testcase with a Makefile that does a native build in my previous testcase repo https://github.com/DirtyHairy/emcc-lto-testcase (I've moved the old testcase to a branch). Unfortunately, I cannot reproduce the crash with my LLVM build (the ToT build from last week that triggers the crash when used with emscripten). Can you spot any error in my build process? To my understanding it should not include the system libc++ headers or link agains the system libc++ |
Given how much this has been reduced, I might be able to make progress with something as simple as a -print-after-all log. If you want to capture that from your reproducer - I can even work with one not from clean upstream - I can see if anything obvious falls out. |
Hi @preames ! I have attached a I can do builds from specific LLVM revisions if that helps you. Thanks again for your support. |
If we have a nobuiltin function, we can't assume we know anything about the implementation. I noticed this when tracing through a log from an in the wild miscompile (emscripten-core/emscripten#9443) triggered after 8666463. We were incorrectly assuming that a custom allocator could not free. (It's not clear yet this is the only problem in said issue.) I also noticed something similiar mentioned in the commit message of ab243e when scrolling back through history. Through, from what I can tell, that commit fixed symptom not root cause. The interface we have for library function detection is extremely error prone, but given the interaction between ``nobuiltin`` decls and ``builtin`` callsites, it's really hard to imagine something much cleaner. I may iterate on that, but it'll be invasive enough I didn't want to hold an obvious functional fix on it.
Digging through the log, I noticed one case where we'd incorrectly annotated a nobuiltin function. This isn't directly related to the triggering change, but it's easy to believe that change exposed the existing issue. There might also be another latent bug here. Can I ask for a rerun with commit 11707435ccb44a9377bfed407453e0646a159636 applied? If it still reproduces, another -print-after-all with the additional flag -print-module-scope added (primarily for the bad run) would be helpful. |
Partial success 😏 With LLVM built from 11707435ccb44a9377bfed407453e0646a159636 the testcase now builds and runs fine with Sorry for the double compression, but the log is too large with gzip, and github will not let me upload bzip2 directly. |
Haven't had a chance to dig through the second set of logs yet, but I think I have a good guess as to the problem. It turns out to be a bug in my original patch w.r.t. interaction between null and the nofree specification. I've got a candidate fix up for review at https://reviews.llvm.org/D100779, and would appreciate knowing if that resolved the remaining problems here. (It may very well be we have a third issue to track down.) |
I have checked, but 2218f5998b5b with the above patch applied still crashes (on the @preames Ping me if you need a more recent log. |
I think I'm going to need a more current log. I've dug through this one, and don't see any direct evidence of the transform triggering or any further obvious miscompiles. Can I ask you to take two additional steps before sending a log? First, can you comment out the problematic transform entirely and just make sure everything passes? I'm getting suspicious that there might be multiple interacting issues here. Second, could you add code to print the function, callsite, and argument when the transform triggers? If needed, I can post a patch that does that. I think I need to see what this is actually triggering on to make progress here. |
Hi @preames ! Sure; however, I think I need a few additional pointers to produce what you want. Do you have a specific transform in mind for me to disable, or do you want to go through them until we have identified the problematic one(s)? Where in LLVM should I look for the place to disable them? |
This change effectively reverts 8666463, but since there have been some changes on top and I wanted to leave the tests in, it's not a mechanical revert. Why revert this now? Two main reasons: 1) There are continuing discussion around what the semantics of nofree. I am getting increasing uncomfortable with the seeming possibility we might redefine nofree in a way incompatible with these changes. 2) There was a reported miscompile triggered by this change (emscripten-core/emscripten#9443). At first, I was making good progress on tracking down the issues exposed and those issues appeared to be unrelated latent bugs. Now that we've found at least one bug in the original change, and the investigation has stalled, I'm no longer comfortable leaving this in tree. In retrospect, I probably should have reverted this earlier and investigated the issues once the triggering change was out of tree.
I have gone ahead and reverted the triggering change in commit 15e19a25 since we don't appear to be making rapid progress any more. At the current moment, I don't have any evidence the (fixed) change was actually buggy any more, but I also am out of time to chase down the remaining miscompile. I may return to this in the future, but wanted to not leave things broken in the meantime. |
Thanks alot @preames . To be 100% sure I just did a test build; the testcases run fine again. Sorry we couldn't drill down into the underlying issue. If want to revisit this at some later point feel free to ping me. |
If we have a nobuiltin function, we can't assume we know anything about the implementation. I noticed this when tracing through a log from an in the wild miscompile (emscripten-core/emscripten#9443) triggered after 8666463. We were incorrectly assuming that a custom allocator could not free. (It's not clear yet this is the only problem in said issue.) I also noticed something similiar mentioned in the commit message of ab243e when scrolling back through history. Through, from what I can tell, that commit fixed symptom not root cause. The interface we have for library function detection is extremely error prone, but given the interaction between ``nobuiltin`` decls and ``builtin`` callsites, it's really hard to imagine something much cleaner. I may iterate on that, but it'll be invasive enough I didn't want to hold an obvious functional fix on it.
This change effectively reverts 8666463, but since there have been some changes on top and I wanted to leave the tests in, it's not a mechanical revert. Why revert this now? Two main reasons: 1) There are continuing discussion around what the semantics of nofree. I am getting increasing uncomfortable with the seeming possibility we might redefine nofree in a way incompatible with these changes. 2) There was a reported miscompile triggered by this change (emscripten-core/emscripten#9443). At first, I was making good progress on tracking down the issues exposed and those issues appeared to be unrelated latent bugs. Now that we've found at least one bug in the original change, and the investigation has stalled, I'm no longer comfortable leaving this in tree. In retrospect, I probably should have reverted this earlier and investigated the issues once the triggering change was out of tree.
Hi, I have the same problem here: There is any solution? |
My stacktrace:
|
I tried version 3.0.0 and the problem still :( |
I ended up solving the problem in my project. It turned out that two different C files were defining slightly different versions of a particular function (one was void while the other returned an int status code). The original fastcomp implementation apparently worked around the problem, while the LLVM backend for whatever reason handles it less gracefully. This may or may not have been the case at the time I opened the issue, but I do see now that there's a Would it make sense for cases like this to be treated as errors rather than warnings? (Although I do have @kripken feel free to close this ticket unless it's still helpful to keep open. |
@buu700 Thanks for the update. Interesting, sounds like this might be undefined behavior then, which the LLVM backend optimizes in more dangerous ways than fastcomp did. Yes, let's close this as there isn't anything specific to fix at this time. If we can get specific reproducers for the other problems mentioned throughout this issue let's file those separately. About |
Regarding wasm-ld warnings, yes you can do |
If we have a nobuiltin function, we can't assume we know anything about the implementation. I noticed this when tracing through a log from an in the wild miscompile (emscripten-core/emscripten#9443) triggered after 8666463. We were incorrectly assuming that a custom allocator could not free. (It's not clear yet this is the only problem in said issue.) I also noticed something similiar mentioned in the commit message of ab243e when scrolling back through history. Through, from what I can tell, that commit fixed symptom not root cause. The interface we have for library function detection is extremely error prone, but given the interaction between ``nobuiltin`` decls and ``builtin`` callsites, it's really hard to imagine something much cleaner. I may iterate on that, but it'll be invasive enough I didn't want to hold an obvious functional fix on it.
This change effectively reverts 86664638, but since there have been some changes on top and I wanted to leave the tests in, it's not a mechanical revert. Why revert this now? Two main reasons: 1) There are continuing discussion around what the semantics of nofree. I am getting increasing uncomfortable with the seeming possibility we might redefine nofree in a way incompatible with these changes. 2) There was a reported miscompile triggered by this change (emscripten-core/emscripten#9443). At first, I was making good progress on tracking down the issues exposed and those issues appeared to be unrelated latent bugs. Now that we've found at least one bug in the original change, and the investigation has stalled, I'm no longer comfortable leaving this in tree. In retrospect, I probably should have reverted this earlier and investigated the issues once the triggering change was out of tree.
I just had to rebuild sidh.js with latest-fastcomp to work around an error caused by latest-upstream.
With the following test case:
The make succeeds, but then I get the following unexpected run-time error:
This problem goes away after running
emsdk install latest-fastcomp ; emsdk activate latest-fastcomp ; make
.The text was updated successfully, but these errors were encountered: