-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New invalid argument error in libzmq, mutex.hpp:142 #2991
Comments
It's fixed in the latest master |
@nightlark did you have any chance to try the latest master? |
I tried the latest commit that was made to master yesterday, and it still had the issue. I'm rerunning the build with the commit made today. |
Today's changes won't affect that. Could you try with this commit? 9bd2d3f |
I'm running a build with that commit now. Pretty much what I found yesterday from the merged PR commits is: 2960 - works |
@bluca You probably assume that this is related to the mutex in random.cpp, right? Maybe that isn't the case at all. It would be good to have some more information, i.e. the stack traces of all threads when the assertion occurs. |
Commit 9bd2d3f works. I need to figure out how to get a stack trace out of travis. |
Okay, then it probably is related to the destruction order somehow. I had a look at where context_t is used in your code, but did not find something suspicous after giving it a quick look. Maybe you can reproduce it locally within gdb? |
I'll see if I can update my Linux vm and get a stack trace later today or tomorrow. |
I was able to reproduce it locally within gdb. Program received signal SIGABRT, Aborted. Thread 2 (Thread 0x2aaaad12c700 (LWP 158139)):
Thread 1 (Thread 0x2aaaaab19940 (LWP 158132)):
|
What's zmqContextManager ? |
It's a singleton wrapper around a pointer to the zmq context. Maybe there's some other things that should be getting done before the zmq context is released? https://github.com/GMLC-TDC/HELICS-src/blob/master/src/helics/common/zmqContextManager.cpp |
Is this cleanup done by a function explicitly registered by atexit? From the stack trace, it appears so (__run_exit_handlers). This cleanup function may only be registered after the first zmq context is created, since this triggers a library initialization, and the cleanup must be done in the reverse order. |
@sigiesec I think this fix is causing more troubles than the original bug it fixed :-( that's on me, as I started this whole global init/deinit rigamarole. I think we should stop doing this with libsodium. The API does say it's not thread safe, but nobody ever reported an issue with it in all these years. The original issue is that with Tweetnacl there was a file descriptor leak: #2632 which is nasty as at some point it causes long running processes which constantly create&destroy contexts (an annoying anti-pattern, but folks use it) to crash when the file descriptor limit is reached. We don't recommend to use Tweetnacl in production anyway - so anybody who faces this initialisation problem can just build with libsodium instead (they should anyway). What do you think? |
@bluca I am not completely sure what you are exactly referring to by "this fix". Do you suggest to revert #2636? If that's the case, I tend to agree :) This is still not a clean AND easy-to-use solution would be to provide zmq_global_init and zmq_global_cleanup functions, which must be explicitly called by the application and by default initialize/cleanup libsodium, but can be configured (at compile time or via a parameter at runtime) not to initialize/cleanup libsodium implicitly. This would need to be used, if the application, or another library libfoo used by the application, also depends on libsodium and uses the same instance of the library. The same would need to be done in libfoo, and only the application would call libsodiums init and cleanup functions. This is a general problem, when any library with multiple incoming dependencies requires initialization and cleanup. But as you said, since no one ever reported an issue with this in years, this probably has no priority. This is probably because an application usually initializes a single zmq context, or at least the first zmq context is initialized from the main thread, and probably does not use libsodium other than through libzmq. Another "solution" might be to completely forbid static builds of libzmq, but this would break many existing uses (including ours). |
Not completely reverting - just removing I really don't like the idea of adding more mandatory APIs - it makes using the library more difficult and convoluted. It's also a backward-incompatible change so we can't do it anyway :-/ |
Ok, but you refer to part of the changes from #2636. If you create a PR, I will give it another look to be sure we talk about the same thing. Hm yes, maybe it could be done in a way that is not mandatory. rabbitmq-c also has a similar problem with its use of openssl, but the solution there is also not completely clean. It allows to optionally explicitly initialize openssl, and in that case, it skips its later implicit initialization of openssl. libprotobuf has an optional cleanup function. If it is not explicitly called, some tools might report memory leaks. But I am not sure if these cases are comparable. |
Solution: restrict it only to the original issue zeromq#2632, Tweetnacl on *NIX when using /dev/urandom, ie: without the new Linux getrandom() syscall. Existing applications might use atexit to register cleanup functions (like CZMQ does), and the current change as-is imposes an ordering that did not exist before - the context MUST be created BEFORE registering the cleanup with atexit. This is a backward incompatible change that is reported to cause aborts in some applications. Although libsodium's documentation says that its initialisation APIs is not thread-safe, nobody has ever reported an issue with it, so avoiding the global init/deinit in the libsodium case is the less risky option we have. Tweetnacl users on Windows and on Linux with getrandom (glibc 2.25 and Linux kernel 3.17) are not affected by the original issue. Fixes zeromq#2991
Solution: restrict it only to the original issue zeromq#2632, Tweetnacl on *NIX when using /dev/urandom, ie: without the new Linux getrandom() syscall. Existing applications might use atexit to register cleanup functions (like CZMQ does), and the current change as-is imposes an ordering that did not exist before - the context MUST be created BEFORE registering the cleanup with atexit. This is a backward incompatible change that is reported to cause aborts in some applications. Although libsodium's documentation says that its initialisation APIs is not thread-safe, nobody has ever reported an issue with it, so avoiding the global init/deinit in the libsodium case is the less risky option we have. Tweetnacl users on Windows and on Linux with getrandom (glibc 2.25 and Linux kernel 3.17) are not affected by the original issue. Fixes zeromq#2991
Solution: restrict it only to the original issue zeromq#2632, Tweetnacl on *NIX when using /dev/urandom, ie: without the new Linux getrandom() syscall. Existing applications might use atexit to register cleanup functions (like CZMQ does), and the current change as-is imposes an ordering that did not exist before - the context MUST be created BEFORE registering the cleanup with atexit. This is a backward incompatible change that is reported to cause aborts in some applications. Although libsodium's documentation says that its initialisation APIs is not thread-safe, nobody has ever reported an issue with it, so avoiding the global init/deinit in the libsodium case is the less risky option we have. Tweetnacl users on Windows and on Linux with getrandom (glibc 2.25 and Linux kernel 3.17) are not affected by the original issue. Fixes zeromq#2991
After trying with the changes in #3001, I'm still seeing the same mutex error. Is there a CMake option that can be used to tell it not to use the global random init/deinit? Something that achieves the same effect as overriding the value of ZMQ_HAVE_THREADSAFE_STATIC_LOCAL_INIT might work. |
There is no option and you shouldn't define anything new - are you using libsodium or tweetnacl? |
The build for the zmq library says it is using tweetnacl, but I don't think we are using any encryption in our code. |
Switch to libsodium - tweetnacl shouldn't be used anywhere but in development |
Thanks! libsodium doesn't have the mutex problem -- since we aren't using tweetnacl or libsodium, would ENABLE_CURVE=OFF disable both of them? |
Yes that's correct |
Hello Invalid argument (src/mutex.hpp:142) |
It cannot be fixed - if your linux distro doesn't have getrandom(), either use libsodium (should do that anyway if running in production) or disable curve entirely |
Thank you for the clear answer! |
I came across this error recently again — apparently the copy of zmq distributed by homebrew on macOS uses tweetnacl. So macOS users may want to compile zmq themselves. |
That's not good. This is the formula: https://github.com/Homebrew/homebrew-core/blob/master/Formula/zeromq.rb are you able to test a change and send them a PR? I don't have OSX available |
I think I can. Should the default package be getting compiled with libsodium (or curve disabled)? |
I think defaulting to libsodium is a good idea, but not sure what the policy guidelines for homebrew are |
This avoids the issue described in zeromq/libzmq#2991 and zeromq/czmq#1299
This is needed for libzmq 4.2.5, so we use libsodium and avoid issues zeromq/libzmq#2991 and zeromq/czmq#1299
This is needed for libzmq 4.2.5, so we use libsodium and avoid issues zeromq/libzmq#2991 and zeromq/czmq#1299
Using TweetNaCl can result in shutdown errors from libzmq, and it is not recommended for "production" use. libsodium is the default recommended by the libzmq maintainers as per the discussion in: zeromq/libzmq#2991 (comment)
Using TweetNaCl can result in shutdown errors from libzmq, and it is not recommended for "production" use. libsodium is the default recommended by the libzmq maintainers as per the discussion in: zeromq/libzmq#2991 (comment) Closes #67039. Signed-off-by: Sean Molenaar <1484494+SMillerDev@users.noreply.github.com> Signed-off-by: BrewTestBot <1589480+BrewTestBot@users.noreply.github.com>
The zeromq bottle on Homebrew now uses libsodium by default. |
I'm still seeing this problem with 4.3.4. |
When rebuilding with libsodium ( |
Issue description
One of the unit test suites we have finishes all of the test cases (which involve starting up and shutting down zmq, probably at least 50 times), but then there is an invalid argument error by zmq after boost reports the number of test cases that have passed. The first time observing this issue was on March 9th, but we cache builds of dependencies. The cached version on our develop branch was last updated February 27. Switching our builds to use the last tagged release (v4.2.3) makes the error go away.
The output for the zmq error: Invalid argument (/home/travis/build/GMLC-TDC/HELICS-src/libzmq/src/mutex.hpp:142)
Environment
Travis CI Linux VMs using gcc-6, gcc-4.9, and clang-3.6
Minimal test code / Steps to reproduce the issue
What's the actual result? (include assertion message & call stack if applicable)
https://travis-ci.org/GMLC-TDC/HELICS-src/jobs/353188544 - see error at the bottom
What's the expected result?
https://travis-ci.org/GMLC-TDC/HELICS-src/jobs/353393510
The text was updated successfully, but these errors were encountered: