Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LLVM backend regression] RuntimeError: unreachable #9443

Closed
buu700 opened this issue Sep 17, 2019 · 55 comments
Closed

[LLVM backend regression] RuntimeError: unreachable #9443

buu700 opened this issue Sep 17, 2019 · 55 comments

Comments

@buu700
Copy link
Contributor

buu700 commented Sep 17, 2019

I just had to rebuild sidh.js with latest-fastcomp to work around an error caused by latest-upstream.

With the following test case:

git clone https://github.com/cyph/sidh.js
cd sidh.js
make
node -e "require('./dist/sidh').keyPair().then(kp => console.log(kp))"

The make succeeds, but then I get the following unexpected run-time error:

RuntimeError: unreachable
RuntimeError: unreachable

/cyph/dist/sidh.js:1
var sidh=function(){var A,I,g,B,C={};function Q(A,I){if(0===A)return I;throw new Error("SIDH error: "+A)}function i(A,I){return new Uint8Array(new Uint8Array(C.HEAPU8.buffer,A,I))}function E(A){try{C._free(A)}catch(A){setTimeout((function(){throw A}),0)}}C.ready=new Promise((function(A,I){(B={}).onAbort=I,B.onRuntimeInitialized=function(){try{B._sidhjs_public_key_bytes(),A(B)}catch(A){I(A)}};var g,B=void 0!==B?B:{},C={};for(g in B)B.hasOwnProperty(g)&&(C[g]=B[g]);var Q=[],i=!1,E=!1,o=!1,a=!1,G=!1;i="object"==typeof window,E="function"==typeof importScripts,a="object"==typeof process&&"object"==typeof process.versions&&"string"==typeof process.versions.node,o=a&&!i&&!E,G=!i&&!o&&!E;var n,r,c,t,D="";function e(A){return B.locateFile?B.locateFile(A,D):D+A}o?(D=__dirname+"/",n=function(A,I){var g;return(g=YA(A))||(c||(c=eval("require")("fs")),t||(t=eval("require")("path")),A=t.normalize(A),g=c.readFileSync(A)),I?g:g.toString()},r=function(A){var I=n(A,!0);return I.buffer||(I=new Uint8Array(
abort(RuntimeError: unreachable). Build with -s ASSERTIONS=1 for more info.

This problem goes away after running emsdk install latest-fastcomp ; emsdk activate latest-fastcomp ; make.

@kripken
Copy link
Member

kripken commented Sep 17, 2019

A full stack trace (in a build with names, so e.g. --profiling) might help. Also good would be to bisect to find where this broke.

@TrevorSundberg
Copy link

TrevorSundberg commented Feb 15, 2020

This is potentially unrelated, however I also received the RuntimeError: unreachable and it turned out it was because a missing return statement when the function returned a bool. Hope this helps some poor soul in the future. Also don't ignore warnings!

@davlhd
Copy link

davlhd commented Mar 3, 2020

I am experiencing the same problem. I wanted to switch to the new backend but it tells me

011245ce-331:8 Uncaught (in promise) RuntimeError: unreachable.
at wasm-function[331]:0x13538
at wasm-function[11511]:0x2356e3
at n.dynCall_iiiiiii (https://xxx35:156437)
at invoke_iiiiiii (https://xxs:35:102325)
at wasm-function[14939]:0x3ae4f3
at wasm-function[11508]:0x235501
at n.dynCall_iiiiiiiii (https://xxxs:35:156969)
at invoke_iiiiiiiii (https://xxx:35:102612)
at wasm-function[14865]:0x3ab14e

@kripken
Copy link
Member

kripken commented Mar 4, 2020

@davlhd building with --profiling will make the stack trace have function names.

Once you find the function this happens in, it might be the same issue @TrevorSundberg said - LLVM will emit unreachables in places it thinks control flow can't reach, so if you forget a return, stuff like that.

@davlhd
Copy link

davlhd commented Jun 4, 2020

That did no really help:

00ceb43a:formatted:403 Uncaught (in promise) RuntimeError: unreachable
    at <anonymous>:wasm-function[285]:0xf851
    at dynCall_iiiiiii (<anonymous>:wasm-function[7519]:0x170aa2)
    at i.dynCall_iiiiiii (https://xxxx:35:140187)
    at invoke_iiiiiii (https://xxxx35:116645)
    at <anonymous>:wasm-function[9936]:0x2be4fd
    at dynCall_iiiiiiiii (<anonymous>:wasm-function[7517]:0x170a78)
    at i.dynCall_iiiiiiiii (https://xxxx:35:140719)
    at invoke_iiiiiiiii (https://xxxx:35:116932)
    at <anonymous>:wasm-function[9087]:0x24ae07
    at dynCall_iiiiii (<anonymous>:wasm-function[7520]:0x170ab4)

@kripken
Copy link
Member

kripken commented Jun 4, 2020

@davlhd is there a chance the --profiling flag was not applied at the link stage? Or that that's an old build? (It would be a very serious unknown bug if that didn't work.)

@davlhd
Copy link

davlhd commented Jun 24, 2020

With "latest" this error disappeared. It now compiles and runs fine

@DoDoENT
Copy link

DoDoENT commented Mar 19, 2021

It appears that this has reappeared with emscripten 2.0.15. With adding --profiling to the link flags, I got the following stack trace in my project (the stack trace points to the Normalize function within google test:

SweatShopTest.wasm:0x187f9 Uncaught RuntimeError: unreachable
    at testing::internal::FilePath::Normalize() (http://localhost:6931/SweatShopTest.wasm:wasm-function[683]:0x187f9)
    at testing::internal::FilePath::FilePath(std::__2::basic_string<char, std::__2::char_traits<char>, std::__2::allocator<char> > const&) (http://localhost:6931/SweatShopTest.wasm:wasm-function[86]:0x25b3)
    at testing::internal::FilePath::GetCurrentDir() (http://localhost:6931/SweatShopTest.wasm:wasm-function[684]:0x18843)
    at testing::internal::UnitTestImpl::AddTestInfo(void (*)(), void (*)(), testing::TestInfo*) (http://localhost:6931/SweatShopTest.wasm:wasm-function[619]:0x1653f)
    at testing::internal::MakeAndRegisterTestInfo(char const*, char const*, char const*, char const*, testing::internal::CodeLocation, void const*, void (*)(), void (*)(), testing::internal::TestFactoryBase*) (http://localhost:6931/SweatShopTest.wasm:wasm-function[103]:0x2cad)
    at __cxx_global_var_init (http://localhost:6931/SweatShopTest.wasm:wasm-function[1158]:0x2b58f)
    at __wasm_call_ctors (http://localhost:6931/SweatShopTest.wasm:wasm-function[1117]:0x29973)
    at Module.___wasm_call_ctors (http://localhost:6931/SweatShopTest.js:3868:82)
    at func (http://localhost:6931/SweatShopTest.js:378:3)
    at callRuntimeCallbacks (http://localhost:6931/SweatShopTest.js:622:4)

Downgrading to emscripten 2.0.14 resolves the issue. Also, if I compile with -g (to emit debug symbols) or with threads enabled, the problem also goes away 🤔.

Any clue what could possibly go wrong?

@kripken
Copy link
Member

kripken commented Mar 19, 2021

Hard to say. Quickest diagnosis is probably bisection, https://emscripten.org/docs/contributing/developers_guide.html#bisecting

@DoDoENT
Copy link

DoDoENT commented Mar 20, 2021

I can't do the bisection as my machines are too slow for compiling the LLVM, however, I've managed to create a reproducible case with only open-source components - it's available in this repository.

To reproduce the issue:

  • clone this repository
  • initialize all its submodules with git submodule update --init
  • initialize your shell with emscripten 2.0.15
  • run emcmake cmake -GNinja /path/to/the/repository
  • build with ninja
  • run emrun --browser chrome ./SweatShop.html
  • application crashes with Uncaught RuntimeError: unreachable

If you do the same with emscripten 2.0.14, the application will work correctly.

Also, if you replace -flto with -flto=thin in the CMakeLists.txt, the application will also work with 2.0.15 (but I need the full LTO for my project).

Unlike in my project, adding -g here does not make the issue go away. Instead, it prints a nice stack trace:

Uncaught RuntimeError: unreachable
    at operator delete(void*) (http://localhost:6931/SweatShopTest.wasm:wasm-function[83]:0x46d1)
    at std::__2::_DeallocateCaller::__do_call(void*) (http://localhost:6931/SweatShopTest.wasm:wasm-function[140]:0x5565)
    at std::__2::_DeallocateCaller::__do_deallocate_handle_size(void*, unsigned long) (http://localhost:6931/SweatShopTest.wasm:wasm-function[138]:0x5516)
    at std::__2::_DeallocateCaller::__do_deallocate_handle_size_align(void*, unsigned long, unsigned long) (http://localhost:6931/SweatShopTest.wasm:wasm-function[135]:0x5497)
    at std::__2::__libcpp_deallocate(void*, unsigned long, unsigned long) (http://localhost:6931/SweatShopTest.wasm:wasm-function[134]:0x542f)
    at std::__2::allocator<char>::deallocate(char*, unsigned long) (http://localhost:6931/SweatShopTest.wasm:wasm-function[1167]:0x15b56)
    at std::__2::allocator_traits<std::__2::allocator<char> >::deallocate(std::__2::allocator<char>&, char*, unsigned long) (http://localhost:6931/SweatShopTest.wasm:wasm-function[1165]:0x15aea)
    at std::__2::basic_string<char, std::__2::char_traits<char>, std::__2::allocator<char> >::~basic_string() (http://localhost:6931/SweatShopTest.wasm:wasm-function[72]:0x44c0)
    at testing::internal::CodeLocation::~CodeLocation() (http://localhost:6931/SweatShopTest.wasm:wasm-function[71]:0x4498)
    at testing::internal::MakeAndRegisterTestInfo(char const*, char const*, char const*, char const*, testing::internal::CodeLocation, void const*, void (*)(), void (*)(), testing::internal::TestFactoryBase*) (http://localhost:6931/SweatShopTest.wasm:wasm-function[70]:0x4442)

I hope this helps you with tracking down this bug.

@kripken
Copy link
Member

kripken commented Mar 20, 2021

@DoDoENT The instructions in that link do not require compiling. It downloads precompiled builds using the emsdk.

Sorry if they are not clear enough about that - we should improve them. @sbc100 also has had some recent improvements to the process which we meant to document (the emsdk now has a flag to install a build by the hash, which is slightly easier than editing the text file) but I don't think we have yet.

@sbc100
Copy link
Collaborator

sbc100 commented Mar 22, 2021

You might also want to try running with asan and/or ubsan.

@DirtyHairy
Copy link

DirtyHairy commented Mar 22, 2021

Count me in. If I compile https://github.com/cloudpilot-emu/cloudpilot with 2.0.15 I get the follow trace at runtime:

    at std::__2::__tree<std::__2::__value_type<ChunkType, Chunk>, std::__2::__map_value_compare<ChunkType, std::__2::__value_type<ChunkType, Chunk>, std::__2::less<ChunkType>, true>, std::__2::allocator<std::__2::__value_type<ChunkType, Chunk> > >::destroy(std::__2::__tree_node<std::__2::__value_type<ChunkType, Chunk>, void*>*) (:4200/cloudpilot_web.wasm:wasm-function[2176]:0x611d1)
    at std::__2::__tree<std::__2::__value_type<ChunkType, Chunk>, std::__2::__map_value_compare<ChunkType, std::__2::__value_type<ChunkType, Chunk>, std::__2::less<ChunkType>, true>, std::__2::allocator<std::__2::__value_type<ChunkType, Chunk> > >::destroy(std::__2::__tree_node<std::__2::__value_type<ChunkType, Chunk>, void*>*) (:4200/cloudpilot_web.wasm:wasm-function[2176]:0x611c6)
    at emscripten_bind_Cloudpilot_LoadState_2 (:4200/cloudpilot_web.wasm:wasm-function[2365]:0x75cbc)
    at push.O5fm.Module._emscripten_bind_Cloudpilot_LoadState_2 (cloudpilot_web.js:4654)
    at Cloudpilot.push.O5fm.Cloudpilot.LoadState.Cloudpilot.LoadState (cloudpilot_web.js:5394)
    at Cloudpilot.loadState (Cloudpilot.ts:265)

The same code runs fine with 2.0.14 and also natively with ASAN and UBSAN enabled. It is worth mentioning that issue only appears if I build with -flto, without LTO the code runs fine. Seems like LTO erroneously removes code generated for std::map and replaces it with unreachable. Unfortunately the readme is not up to date, but I can provide instructions for reproducing the issue.

Coincidentally (?) I also get link time errors on 2.0.15 with LLD_REPORT_UNDEFINED:

wasm-ld: error: symbol exported via --export not found: __start_em_asm
wasm-ld: error: symbol exported via --export not found: __stop_em_asm

I don't have time for bisecting now, but I can do so later this week.

@sbc100
Copy link
Collaborator

sbc100 commented Mar 22, 2021

Thanks for the report @DirtyHairy. The LTO issue does look concerning, and we will probably need to bisect it down. I imagine its an llvm-side change.

The symbol exported via --export not found: __start_em_asm issue with LLD_REPORT_UNDEFINED I am aware of and thinking about how to work around that.

@sbc100
Copy link
Collaborator

sbc100 commented Mar 22, 2021

opened #13736 for the LLD_REPORT_UNDEFINED issue.

@DirtyHairy
Copy link

Thanks alot @sbc100 ! I'll try to find time to bisect the issue over the next few days

@DirtyHairy
Copy link

I have bisected the issue; the problematic commit to emscripten-releases is 7882b2e11c66da69b832a6ad7baa57c8ccbf9656

Indeed this is a LLVM change; unfortunately, the commit pulls in sizable number of changesets. I hope this helps nevertheless.

@kripken
Copy link
Member

kripken commented Mar 29, 2021

Thanks! Yes, looks like there are 94 LLVM commits there:

https://chromium.googlesource.com/emscripten-releases/+/7882b2e11c66da69b832a6ad7baa57c8ccbf9656

That will require someone to bisect locally on LLVM I think, unless one of the commits there looks relevant to the LLVM devs here.

@DirtyHairy
Copy link

DirtyHairy commented Mar 29, 2021

Well, this might take a few days, but I am not using my Macbook much during the day atm, so I think I can take the ring to mount doom 😏

@sbc100
Copy link
Collaborator

sbc100 commented Mar 29, 2021

As well as bisecting to find the culprit commit, it might also be useful to reduce the repro case to something smaller that might be give us more clues.

@DirtyHairy
Copy link

As well as bisecting to find the culprit commit, it might also be useful to reduce the repro case to something smaller that might be give us more clues.

The affected code is pretty much isolated, so I think this might not be too hard. I'll take a stab at it after bisecting.

@DirtyHairy
Copy link

Well, that went down a lot faster than I thought. The faulty commit is:

llvm/llvm-project@8666463

Seems like we indeed need a reduced testcase. I'll see what I can do.

@DirtyHairy
Copy link

DirtyHairy commented Apr 2, 2021

I've cooked up a reduced testcase. You can find the code here: https://github.com/DirtyHairy/emcc-lto-testcase

Simply build with make (with 2.0.15) and run node testcase.js in order to trigger the bug:

pestix@bladehenge-2:~/git/emcc-lto-testcase$ node testcase.js
exception thrown: RuntimeError: unreachable,RuntimeError: unreachable
    at std::__2::__tree<std::__2::__value_type<ChunkType, ChunkProbe>, std::__2::__map_value_compare<ChunkType, std::__2::__value_type<ChunkType, ChunkProbe>, std::__2::less<ChunkType>, true>, std::__2::allocator<std::__2::__value_type<ChunkType, ChunkProbe> > >::destroy(std::__2::__tree_node<std::__2::__value_type<ChunkType, ChunkProbe>, void*>*) (<anonymous>:wasm-function[10]:0x8df)
    at __original_main (<anonymous>:wasm-function[8]:0x864)
    at main (<anonymous>:wasm-function[20]:0x278b)
    at Module._main (/Users/pestix/git/emcc-lto-testcase/testcase.js:1555:60)
    at callMain (/Users/pestix/git/emcc-lto-testcase/testcase.js:1615:15)
    at doRun (/Users/pestix/git/emcc-lto-testcase/testcase.js:1675:23)
    at run (/Users/pestix/git/emcc-lto-testcase/testcase.js:1690:5)
    at runCaller (/Users/pestix/git/emcc-lto-testcase/testcase.js:1602:19)
    at removeRunDependency (/Users/pestix/git/emcc-lto-testcase/testcase.js:1225:7)
    at receiveInstance (/Users/pestix/git/emcc-lto-testcase/testcase.js:1365:5)

The code runs fine with 2.0.14 or if -flto is omitted during link.

From the trace I suspect that the deleted code is in the destructor of the std::map<ChunkType, ChunkProbe> allocated in SavestateProbe.cxx (an instance of which is allocated on the stack in Savestate::AllocateBuffer.

@DirtyHairy
Copy link

Is there anything else that I can do to help addressing this issue?

@kripken
Copy link
Member

kripken commented Apr 9, 2021

@DirtyHairy Please file an LLVM bug (and link to it here). As the commit you reduced to says, it sounds like they expected fallout from it. Attaching your testcase will hopefully be enough for them to investigate and/or decide whether to revert it.

@kripken
Copy link
Member

kripken commented Apr 9, 2021

(Actually maybe it's faster to cc the author of that commit directly?)

@preames , the last few comments have a reduced testcase and some investigation of a regression that was bisected down to your commit llvm/llvm-project@8666463

@preames
Copy link

preames commented Apr 9, 2021

Happy to try and help here, but I really don't have much context on emscripten or node. So working from a test case which involves those pieces would be a bit challenging. (For instance, it's not clear to me whether your reduced test case was using ToT clang for the make step, or some other compiler?)

If you want to help narrow this down, what would be really helpful is knowing what the IR looks like at the point the nofree transform actually happens. A simple way to get that would be to do a local build of clang w/the transform replaced with a bit of code which simply printed the current module and then crashed. (If it triggers multiple times, you might need to log and continue.)

Once we had that, we could manually inspect the IR to see if the transform is legal. If it isn't, we revert my patch. If it is, we then have something to trace back through the print-after-all output to figure out where the wrong inference fact came from.

As an aside, I just ran across an issue today (see the llvm-dev thread "Ambiguity in the nofree function attribute") around the semantics of the nofree function attribute. If you were using the allocsize attribute, then maybe that might explain a miscompile. That's really a long shot guess though, nothing more.

@kripken
Copy link
Member

kripken commented Apr 9, 2021

Thanks for the quick response @preames !

Maybe another possibility is to convert the testcase to one that runs in ToT clang? It might be as simple as replacing emcc with ToT clang, and removing the .js suffix from the output.

I tried to do that quickly now, and I'm hitting an unrelated LLVM configuration issue I can't immediately figure out,

/usr/bin/ld: ../lib/LLVMgold.so: error loading plugin: ../lib/LLVMgold.so: cannot open shared object file: No such file or directory

But if you have a working local build of LLVM ToT then it might just work for you.

@DirtyHairy
Copy link

Thanks for the response, @preames / @kripken !

@preames I am not sure whether I understand top-of-tree correctly. Fwiw, I can reproduce the issue with the latest revision of LLVM main, including clang.

I have just checked and cannot reproduce the bug if I build native binaries. However, as even with emscripten the bug only appears if I enable LTO at link I think that the bug may be tied to a transformation that happens during LTO when linking the emscripten libc.

It seems that emscripten links to a libc that has been built with LTO enabled if I enable LTO during link? If this is the case, then this might be the decisive difference between emscripten and the native build (I don't think this happens with libc from the XCode SDK). Is there a way to disable this (use -flto objects, but not -flto libc) for testing purposes?

@preames If you can give me a few pointers of where to instrument which file I can try to produce the IR dumps you described.

@DirtyHairy
Copy link

DirtyHairy commented Apr 15, 2021

Yeah, sorry, my bad, I just noticed that the sample above runs fine with -O2 and -O3 (and crashes only with -O1). It seems that the allocation and deallocation is optimized out. The following crashes at all levels > 0:

#include <iostream>

using namespace std;

struct Foo {
    virtual ~Foo() { cout << "destructor called" << endl << flush; }
};

int main() {
    auto foo = new Foo();

    cout << &foo << endl << flush;

    delete foo;
}

@kripken
Copy link
Member

kripken commented Apr 15, 2021

I think the simplest testcase here that I can reproduce locally is this one:

// emcc -flto -O1
#include <map>
int main() {
  std::map<int, int> foo;
  foo[0] = 2;
}

It does seem like the cause is LTO with libc++, as that is shared in all the testcases.

I'd suggest making a native example of this for @preames by making a git repo with the cpp file plus a copy of libc++ from the emscripten tree (it's under system/lib/libcxx). Then all the libc++ inputs will be "in userspace", like for emcc, and no system c++ libraries will be linked. I'd expect the bug to reproduce that way.

@DirtyHairy
Copy link

Sorry, now I don't fully understand: how would I do a native build from this? I suppose I have to build libc++ first?

@kripken
Copy link
Member

kripken commented Apr 15, 2021

Yes, but it's very simple: you will likely need very few of the libc++ .cpp files (since it's almost all in headers). You may even be able to get away with not running any configure/cmake step, and just adding them.

In more detail, what you need to do is build with -nostdinc++, so that the system C++ standard library is not used, and add an -I for the place with the copy of libc++ so that it is used. And find the right .cpp files from that copy of libc++ and just include them as more inputs, alongside your .cpp file. That way all the C++ standard library support arrives "from userspace", and that means the compiler can LTO it. (Also, doing it this way will ensure you have the same C++ standard library as emcc does.)

Another option could be to copy code from the libc++ sources into your testcase (edit: and keep copying more until the bug reproduces).

@DirtyHairy
Copy link

DirtyHairy commented Apr 15, 2021

Thanks 😏 I have put up a version of the testcase with a Makefile that does a native build in my previous testcase repo https://github.com/DirtyHairy/emcc-lto-testcase (I've moved the old testcase to a branch). Unfortunately, I cannot reproduce the crash with my LLVM build (the ToT build from last week that triggers the crash when used with emscripten).

Can you spot any error in my build process? To my understanding it should not include the system libc++ headers or link agains the system libc++

@preames
Copy link

preames commented Apr 16, 2021

Given how much this has been reduced, I might be able to make progress with something as simple as a -print-after-all log. If you want to capture that from your reproducer - I can even work with one not from clean upstream - I can see if anything obvious falls out.

@DirtyHairy
Copy link

Hi @preames !

I have attached a -print-after-all log of the failing std::map testcase (ir.bad.log) produced with emscripten 2.0.16 and a reference log(ir.good.log) with emscripten 2.0.14.

I can do builds from specific LLVM revisions if that helps you.

Thanks again for your support.

ir.bad.log
ir.good.log

preames added a commit to llvm/llvm-project that referenced this issue Apr 16, 2021
If we have a nobuiltin function, we can't assume we know anything about the implementation.

I noticed this when tracing through a log from an in the wild miscompile (emscripten-core/emscripten#9443) triggered after 8666463.  We were incorrectly assuming that a custom allocator could not free.  (It's not clear yet this is the only problem in said issue.)

I also noticed something similiar mentioned in the commit message of ab243e when scrolling back through history.  Through, from what I can tell, that commit fixed symptom not root cause.

The interface we have for library function detection is extremely error prone, but given the interaction between ``nobuiltin`` decls and ``builtin`` callsites, it's really hard to imagine something much cleaner.  I may iterate on that, but it'll be invasive enough I didn't want to hold an obvious functional fix on it.
@preames
Copy link

preames commented Apr 17, 2021

Digging through the log, I noticed one case where we'd incorrectly annotated a nobuiltin function. This isn't directly related to the triggering change, but it's easy to believe that change exposed the existing issue. There might also be another latent bug here.

Can I ask for a rerun with commit 11707435ccb44a9377bfed407453e0646a159636 applied? If it still reproduces, another -print-after-all with the additional flag -print-module-scope added (primarily for the bad run) would be helpful.

@DirtyHairy
Copy link

Partial success 😏 With LLVM built from 11707435ccb44a9377bfed407453e0646a159636 the testcase now builds and runs fine with -O1. However, it still crashes at -O2 and -O3. I have attached two logs at -O2, one with -print-module-scope, and one without it. Beware though, the -print-module-scope one decompresses to ~800MB.

Sorry for the double compression, but the log is too large with gzip, and github will not let me upload bzip2 directly.

ir.bad.log.bz2.gz
ir.bad.print-module-scope.log.bz2.gz

@preames
Copy link

preames commented Apr 19, 2021

Haven't had a chance to dig through the second set of logs yet, but I think I have a good guess as to the problem. It turns out to be a bug in my original patch w.r.t. interaction between null and the nofree specification. I've got a candidate fix up for review at https://reviews.llvm.org/D100779, and would appreciate knowing if that resolved the remaining problems here. (It may very well be we have a third issue to track down.)

@DirtyHairy
Copy link

DirtyHairy commented Apr 20, 2021

I have checked, but 2218f5998b5b with the above patch applied still crashes (on the std::map). Fwiw there is some variance in behaviour between emscripten (with LLVM built from source) 2.0.16 and 2.0.14 --- with 2.0.16, the std::map testcase dies for all optimization levels while it runs fine with -O1. With the virtual destructor testcase, the crash reproduces with -O1 on 2.0.16 (and LLVM from source) --- 2.0.14 works fine, as does 2.0.16 with -O2 and -O3 😛

@preames Ping me if you need a more recent log.

@preames
Copy link

preames commented Apr 21, 2021

I think I'm going to need a more current log. I've dug through this one, and don't see any direct evidence of the transform triggering or any further obvious miscompiles.

Can I ask you to take two additional steps before sending a log? First, can you comment out the problematic transform entirely and just make sure everything passes? I'm getting suspicious that there might be multiple interacting issues here. Second, could you add code to print the function, callsite, and argument when the transform triggers? If needed, I can post a patch that does that. I think I need to see what this is actually triggering on to make progress here.

@DirtyHairy
Copy link

Hi @preames !

Sure; however, I think I need a few additional pointers to produce what you want. Do you have a specific transform in mind for me to disable, or do you want to go through them until we have identified the problematic one(s)? Where in LLVM should I look for the place to disable them?

preames added a commit to llvm/llvm-project that referenced this issue Apr 22, 2021
This change effectively reverts 8666463, but since there have been some changes on top and I wanted to leave the tests in, it's not a mechanical revert.

Why revert this now?  Two main reasons:
1) There are continuing discussion around what the semantics of nofree.  I am getting increasing uncomfortable with the seeming possibility we might redefine nofree in a way incompatible with these changes.
2) There was a reported miscompile triggered by this change (emscripten-core/emscripten#9443).  At first, I was making good progress on tracking down the issues exposed and those issues appeared to be unrelated latent bugs.  Now that we've found at least one bug in the original change, and the investigation has stalled, I'm no longer comfortable leaving this in tree.  In retrospect, I probably should have reverted this earlier and investigated the issues once the triggering change was out of tree.
@preames
Copy link

preames commented Apr 22, 2021

I have gone ahead and reverted the triggering change in commit 15e19a25 since we don't appear to be making rapid progress any more. At the current moment, I don't have any evidence the (fixed) change was actually buggy any more, but I also am out of time to chase down the remaining miscompile. I may return to this in the future, but wanted to not leave things broken in the meantime.

@DirtyHairy
Copy link

Thanks alot @preames . To be 100% sure I just did a test build; the testcases run fine again. Sorry we couldn't drill down into the underlying issue. If want to revisit this at some later point feel free to ping me.

arichardson pushed a commit to arichardson/llvm-project that referenced this issue Sep 10, 2021
If we have a nobuiltin function, we can't assume we know anything about the implementation.

I noticed this when tracing through a log from an in the wild miscompile (emscripten-core/emscripten#9443) triggered after 8666463.  We were incorrectly assuming that a custom allocator could not free.  (It's not clear yet this is the only problem in said issue.)

I also noticed something similiar mentioned in the commit message of ab243e when scrolling back through history.  Through, from what I can tell, that commit fixed symptom not root cause.

The interface we have for library function detection is extremely error prone, but given the interaction between ``nobuiltin`` decls and ``builtin`` callsites, it's really hard to imagine something much cleaner.  I may iterate on that, but it'll be invasive enough I didn't want to hold an obvious functional fix on it.
arichardson pushed a commit to arichardson/llvm-project that referenced this issue Sep 10, 2021
This change effectively reverts 8666463, but since there have been some changes on top and I wanted to leave the tests in, it's not a mechanical revert.

Why revert this now?  Two main reasons:
1) There are continuing discussion around what the semantics of nofree.  I am getting increasing uncomfortable with the seeming possibility we might redefine nofree in a way incompatible with these changes.
2) There was a reported miscompile triggered by this change (emscripten-core/emscripten#9443).  At first, I was making good progress on tracking down the issues exposed and those issues appeared to be unrelated latent bugs.  Now that we've found at least one bug in the original change, and the investigation has stalled, I'm no longer comfortable leaving this in tree.  In retrospect, I probably should have reverted this earlier and investigated the issues once the triggering change was out of tree.
@paulocoutinhox
Copy link

Hi,

I have the same problem here:
paulocoutinhox/pdfium-lib#54

There is any solution?

@paulocoutinhox
Copy link

My stacktrace:

index.js:77 Uncaught RuntimeError: unreachable
    at pdfium::base::AllocPages(void*, unsigned long, unsigned long, pdfium::base::PageAccessibilityConfiguration, pdfium::base::PageTag, bool) (index.wasm:0x71085)
    at pdfium::base::internal::PartitionBucket::SlowPathAlloc(pdfium::base::internal::PartitionRootBase*, int, unsigned long, bool*) (index.wasm:0x72900)
    at fxcrt::StringDataTemplate<char>::Create(unsigned long) (index.wasm:0x5628)
    at fxcrt::StringDataTemplate<char>::Create(char const*, unsigned long) (index.wasm:0x5695)
    at fxcrt::ByteString::ByteString(char const*) (index.wasm:0x886e)
    at CLinuxPlatform::CreateDefaultSystemFontInfo() (index.wasm:0xf501)
    at CFX_GEModule::Create(char const**) (index.wasm:0x13666)
    at FPDF_InitLibraryWithConfig (index.wasm:0x70a9e)
    at main (index.wasm:0x367b)
    at index.js:1591
$pdfium::base::AllocPages(void*, unsigned long, unsigned long, pdfium::base::PageAccessibilityConfiguration, pdfium::base::PageTag, bool) @ index.wasm:0x71085
$pdfium::base::internal::PartitionBucket::SlowPathAlloc(pdfium::base::internal::PartitionRootBase*, int, unsigned long, bool*) @ index.wasm:0x72900
$fxcrt::StringDataTemplate<char>::Create(unsigned long) @ index.wasm:0x5628
$fxcrt::StringDataTemplate<char>::Create(char const*, unsigned long) @ index.wasm:0x5695
$fxcrt::ByteString::ByteString(char const*) @ index.wasm:0x886e
$CLinuxPlatform::CreateDefaultSystemFontInfo() @ index.wasm:0xf501
$CFX_GEModule::Create(char const**) @ index.wasm:0x13666
$FPDF_InitLibraryWithConfig @ index.wasm:0x70a9e
$main @ index.wasm:0x367b
(anonymous) @ index.js:1591
callMain @ index.js:5933
doRun @ index.js:5990
(anonymous) @ index.js:6001
setTimeout (async)
run @ index.js:5997
runCaller @ index.js:5911
removeRunDependency @ index.js:1520
receiveInstance @ index.js:1680
receiveInstantiationResult @ index.js:1697
Promise.then (async)
(anonymous) @ index.js:1726
Promise.then (async)
instantiateAsync @ index.js:1723
createWasm @ index.js:1754
(anonymous) @ index.js:5465

@paulocoutinhox
Copy link

I tried version 3.0.0 and the problem still :(

@buu700
Copy link
Contributor Author

buu700 commented Jul 11, 2022

I ended up solving the problem in my project. It turned out that two different C files were defining slightly different versions of a particular function (one was void while the other returned an int status code).

The original fastcomp implementation apparently worked around the problem, while the LLVM backend for whatever reason handles it less gracefully. This may or may not have been the case at the time I opened the issue, but I do see now that there's a wasm-ld: warning: function signature mismatch message that gets printed when the problem is present.

Would it make sense for cases like this to be treated as errors rather than warnings? (Although I do have ERROR_ON_UNDEFINED_SYMBOLS enabled, which may or may not affect the behavior here.)

@kripken feel free to close this ticket unless it's still helpful to keep open.

@kripken
Copy link
Member

kripken commented Jul 12, 2022

@buu700 Thanks for the update. Interesting, sounds like this might be undefined behavior then, which the LLVM backend optimizes in more dangerous ways than fastcomp did.

Yes, let's close this as there isn't anything specific to fix at this time. If we can get specific reproducers for the other problems mentioned throughout this issue let's file those separately.

About wasm-ld: warning: function signature mismatch, I hope that would be an error if you link with -Werror to turn warnings to errors. @sbc100 does that flag apply to wasm-ld warnings too?

@kripken kripken closed this as completed Jul 12, 2022
@sbc100
Copy link
Collaborator

sbc100 commented Jul 12, 2022

Regarding wasm-ld warnings, yes you can do -Wl,--fatal-warnings.

mem-frob pushed a commit to draperlaboratory/hope-llvm-project that referenced this issue Oct 7, 2022
If we have a nobuiltin function, we can't assume we know anything about the implementation.

I noticed this when tracing through a log from an in the wild miscompile (emscripten-core/emscripten#9443) triggered after 8666463.  We were incorrectly assuming that a custom allocator could not free.  (It's not clear yet this is the only problem in said issue.)

I also noticed something similiar mentioned in the commit message of ab243e when scrolling back through history.  Through, from what I can tell, that commit fixed symptom not root cause.

The interface we have for library function detection is extremely error prone, but given the interaction between ``nobuiltin`` decls and ``builtin`` callsites, it's really hard to imagine something much cleaner.  I may iterate on that, but it'll be invasive enough I didn't want to hold an obvious functional fix on it.
mem-frob pushed a commit to draperlaboratory/hope-llvm-project that referenced this issue Oct 7, 2022
This change effectively reverts 86664638, but since there have been some changes on top and I wanted to leave the tests in, it's not a mechanical revert.

Why revert this now?  Two main reasons:
1) There are continuing discussion around what the semantics of nofree.  I am getting increasing uncomfortable with the seeming possibility we might redefine nofree in a way incompatible with these changes.
2) There was a reported miscompile triggered by this change (emscripten-core/emscripten#9443).  At first, I was making good progress on tracking down the issues exposed and those issues appeared to be unrelated latent bugs.  Now that we've found at least one bug in the original change, and the investigation has stalled, I'm no longer comfortable leaving this in tree.  In retrospect, I probably should have reverted this earlier and investigated the issues once the triggering change was out of tree.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants