-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deadlock between zmq_poll, zmq_send using inproc PAIR sockets #2759
Comments
Do you happen to share the sockets between threads? |
Thanks for the suggestions — see below:
On Sep 29, 2017, at 4:26 AM, Asmod4n ***@***.***> wrote:
Do you happen to share the sockets between threads?
No
Could you give a code example to reproduce it?
Not yet. I was hoping that someone who knows the code would be able to tell from the stack trace what the problem is.
… —
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#2759 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGUfn4ax0-VbbgoHkx1NiIycnP5rv4XBks5snKnAgaJpZM4PncD3>.
|
Why do you think there is a deadlock? Poll's timeout is -1, it could just wait for something to arrive. |
As mentioned in the original post, one thread is waiting on zmq_poll, another is attempting to zmq_send to one of the sockets being polled, and is also blocked in the send.
That is a classic deadlock.
… On Oct 4, 2017, at 2:24 AM, Ilya Kulakov ***@***.***> wrote:
Why do you think there is a deadlock? Poll's timeout is -1, it could just wait for something to arrive.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#2759 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGUfn4yRp3r7d6CV18U1ws1jq4EYCkDJks5soySQgaJpZM4PncD3>.
|
To expand on earlier email (now that I’ve had my coffee ;-)
Here’s the poll thread:
```
Thread 5 (Thread 0x7feea7608700 (LWP 19298)):
#0 0x00000033bfedf383 in poll () from /lib64/libc.so.6
No symbol table info available.
#1 0x00007feeaa7ce83e in zmq::socket_poller_t::wait (this=0x7feea76077f0, events_=0xd52660, n_events_=3, timeout_=-1)
at /home/btorpey/work/libzmq/master/src/socket_poller.cpp:447
rc = 1
timeout = -1
found = 0
clock = {last_tsc = 833960319375223, last_time = 269616557}
now = 0
end = 0
first_pass = false
#2 0x00007feeaa7cc376 in zmq_poller_wait_all (poller_=0x7feea76077f0, events=0xd52660, n_events=3, timeout_=-1)
at /home/btorpey/work/libzmq/master/src/zmq.cpp:1371
rc = 0
#3 0x00007feeaa7ccbda in zmq_poller_poll (items_=0x7feea76078c0, nitems_=3, timeout_=-1)
at /home/btorpey/work/libzmq/master/src/zmq.cpp:813
poller = {tag = 3405691582, signaler = 0x0, items = std::vector of length 3, capacity 4 = {{socket = 0xcaf940, fd = 0,
user_data = 0x0, events = 1, pollfd_index = -1}, {socket = 0xcb2d40, fd = 0, user_data = 0x0, events = 1,
pollfd_index = -1}, {socket = 0xcd57f0, fd = 0, user_data = 0x0, events = 1, pollfd_index = -1}},
need_rebuild = false, use_signaler = false, poll_size = 3, pollfds = 0x1192eb0}
repeat_items = false
j_start = 32750
found_events = 32750
rc = 0
events = 0xd52660
#4 0x00007feeaa7cbd09 in zmq_poll (items_=0x7feea76078c0, nitems_=3, timeout_=-1)
at /home/btorpey/work/libzmq/master/src/zmq.cpp:861
No locals.
#5 0x00007feeaaa284d4 in zmqBridgeMamaTransportImpl_dispatchThread (closure=0xcaa4b0)
at /home/btorpey/work/OpenMAMA-zmq/nyfix/src/transport.c:1118
size = -1
items = {{socket = 0xcaf940, fd = 0, events = 1, revents = 0}, {socket = 0xcb2d40, fd = 0, events = 1, revents = 0}, {
socket = 0xcd57f0, fd = 0, events = 1, revents = 0}}
rc = 1
impl = 0xcaa4b0
zmsg = {
_ = "\000\000\000\000\000\000\000\000`&\325\000\000\000\000\000\340d\271G\374\177\000\000\005\235\242\252\356\177\000\000\070\033\322\000\000\000\000\000 \000e\000\000\000\000\000 *\325", '\000' <repeats 12 times>}
#6 0x00000033c0607aa1 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#7 0x00000033bfee8bcd in clone () from /lib64/libc.so.6
No symbol table info available.
```
And here’s the send thread:
```
Thread 1 (Thread 0x7feeb26fdc60 (LWP 19291)):
#0 0x00000033bfedf383 in poll () from /lib64/libc.so.6
No symbol table info available.
#1 0x00007feeaa7a5127 in zmq::signaler_t::wait (this=0x1197b48, timeout_=-1)
at /home/btorpey/work/libzmq/master/src/signaler.cpp:233
pfd = {fd = 38, events = 1, revents = 0}
rc = 0
#2 0x00007feeaa7797bb in zmq::mailbox_t::recv (this=0x1197ae0, cmd_=0x7ffc47b96280, timeout_=-1)
at /home/btorpey/work/libzmq/master/src/mailbox.cpp:81
rc = -1
ok = false
#3 0x00007feeaa7aaaee in zmq::socket_base_t::process_commands (this=0x11975f0, timeout_=-1, throttle_=false)
at /home/btorpey/work/libzmq/master/src/socket_base.cpp:1335
rc = 0
cmd = {destination = 0x1193400, type = zmq::command_t::pipe_term_ack, args = {stop = {<No data fields>},
plug = {<No data fields>}, own = {object = 0x7feea7607510}, attach = {engine = 0x7feea7607510}, bind = {pipe =
0x7feea7607510}, activate_read = {<No data fields>}, activate_write = {msgs_read = 140662987060496}, hiccup = {
pipe = 0x7feea7607510}, pipe_term = {<No data fields>}, pipe_term_ack = {<No data fields>}, pipe_hwm = {
inhwm = -1486850800, outhwm = 32750}, term_req = {object = 0x7feea7607510}, term = {linger = -1486850800},
term_ack = {<No data fields>}, reap = {socket = 0x7feea7607510}, reaped = {<No data fields>},
done = {<No data fields>}}}
#4 0x00007feeaa7aa2e4 in zmq::socket_base_t::send (this=0x11975f0, msg_=0x7ffc47b963f0, flags_=0)
at /home/btorpey/work/libzmq/master/src/socket_base.cpp:1148
sync_lock = {mutex = 0x0}
rc = -1
timeout = -1
end = 0
#5 0x00007feeaa7cab76 in s_sendmsg (s_=0x11975f0, msg_=0x7ffc47b963f0, flags_=0)
at /home/btorpey/work/libzmq/master/src/zmq.cpp:375
sz = 257
rc = 0
max_msgsz = 140663039244940
#6 0x00007feeaa7cacbd in zmq_send (s_=0x11975f0, buf_=0x7ffc47b964f0, len_=257, flags_=0)
at /home/btorpey/work/libzmq/master/src/zmq.cpp:409
msg = {
_ = "\000\000\000\000\000\000\000\000\240q\031\001\000\000\000\000Pd\271G\374\177\000\000\005\235\242\252\356\177\000\000\370\313\323\000\000\000\000\000\360uf\000\000\000\000\000\360u\031\001", '\000' <repeats 11 times>}
__PRETTY_FUNCTION__ = "int zmq_send(void*, const void*, size_t, int)"
s = 0x11975f0
rc = 0
#7 0x00007feeaaa2abd9 in zmqBridgeMamaTransportImpl_sendCommand (impl=0xcaa4b0, msg=0x7ffc47b964f0, msgSize=257)
at /home/btorpey/work/OpenMAMA-zmq/nyfix/src/transport.c:1656
status = MAMA_STATUS_OK
temp = 0x11975f0
__FUNCTION__ = "zmqBridgeMamaTransportImpl_sendCommand"
rc = 0
i = 0
#8 0x00007feeaaa256eb in zmqBridgeMamaSubscriptionImpl_subscribe (transport=0xcaa4b0, topic=0xcf6890 "MXLIB/BU")
at /home/btorpey/work/OpenMAMA-zmq/nyfix/src/subscription.c:417
s = MAMA_STATUS_OK
msg = {command = 83 'S',
arg1 = "MXLIB/BU\000\000\000\000\000\000\000\t\000\000\000\000\000\000\000С\312\000\000\000\000\000\020\071\323", '\000' <repeats 21 times>, "\350\303\347\277\063\000\000\000\220h\317\000\000\000\000\000\322\016\350\277\063\000\000\000\020\071\323\000\000\000\000\000\370\313\323\000\000\000\000\000\260e\271G\374\177\000\000g\000\273\262\356\177\000\000\020\071\323", '\000' <repeats 21 times>, "\020/\031\001\000\000\000\000\220h\317\000\000\000\000\000\235\370\272\262\356\177\000\000\036\000\000\000\000\000\000\000\060\267\312\000\000\000\000\000\200\273\312\000\000\000\000\000\060\267\312\000\000\000\000\000\000f\271G\374\177\000\000l\263"...}
__FUNCTION__ = "zmqBridgeMamaSubscriptionImpl_subscribe"
#9 0x00007feeaaa253f6 in zmqBridgeMamaSubscriptionImpl_create (impl=0xd527e0, source=0x1192600 "MXLIB/BU", symbol=0x0)
at /home/btorpey/work/OpenMAMA-zmq/nyfix/src/subscription.c:336
s = MAMA_STATUS_OK
__FUNCTION__ = "zmqBridgeMamaSubscriptionImpl_create"
#10 0x00007feeaaa24b56 in zmqBridgeMamaSubscription_create (subscriber=0xd52f20, source=0x1192600 "MXLIB/BU", symbol=0x0,
tport=0xcaa1d0, queue=0xcf8990, callback=..., subscription=0xd52f10, closure=0xd62f90)
at /home/btorpey/work/OpenMAMA-zmq/nyfix/src/subscription.c:78
s = MAMA_STATUS_OK
__FUNCTION__ = "zmqBridgeMamaSubscription_create"
impl = 0xd527e0
...
```
This sure looks like a deadlock. Note also that the deadlock only occurs with PAIR sockets — with PUB/SUB sockets, there is no deadlock, but messages are intermittently lost.
Any help would be welcome — including sample code that is known to work. As per previous, the code on the send side looks like this:
```
void* temp = zmq_socket(impl->mZmqContext, ZMQ_PUB);
zmq_connect(temp, ZMQ_CONTROL_ENDPOINT);
zmq_pollitem_t pollitems [] = { { temp, 0, ZMQ_POLLIN, 0 } };
zmq_poll(pollitems, 1, 1); // see #2267 <#2267>
zmq_send(temp, msg, msgSize, 0);
zmq_close(temp);
```
… On Oct 4, 2017, at 7:02 AM, Bill Torpey ***@***.***> wrote:
As mentioned in the original post, one thread is waiting on zmq_poll, another is attempting to zmq_send to one of the sockets being polled, and is also blocked in the send.
That is a classic deadlock.
> On Oct 4, 2017, at 2:24 AM, Ilya Kulakov ***@***.*** ***@***.***>> wrote:
>
> Why do you think there is a deadlock? Poll's timeout is -1, it could just wait for something to arrive.
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub <#2759 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGUfn4yRp3r7d6CV18U1ws1jq4EYCkDJks5soySQgaJpZM4PncD3>.
>
|
I've created a sample program that demonstrates the problem, using "pure" 0MQ code. The repo is at https://github.com/WallStProg/zmqtests.git. Please see the README.md etc. in threads directory. As mentioned above, PAIR sockets deadlock, PUB sockets don't, but both silently drop (lots of) messages. I imagine that I'm doing something wrong here, but for the life of me I can't see what it is. Any help would be appreciated! |
As the documentation says socket must not be used from multiple threads. |
Which socket are you referring to? The controlReceiver is being used from a single thread, the controlSender is created and destroyed from the sending threads each time. As far as I can tell, there is no sharing of sockets between threads. That is in fact the reason for using the controlSub -- to be able to execute commands on the dataSub socket from a single thread. If you can be more specific about what you think is wrong, that would be very helpful. |
Unless I'm mistaken, sockets are created and bound in the main thread: https://github.com/WallStProg/zmqtests/blob/master/threads/threads.cpp#L197 And then used from another thread: https://github.com/WallStProg/zmqtests/blob/master/threads/threads.cpp#L141 Also creating and immediately deleting a socket is an anti-pattern: https://github.com/WallStProg/zmqtests/blob/master/threads/threads.cpp#L62 It's better to make them long-lived and reuse them. |
Especially with ZMQ_PAIR Sockets its better to only bind and connect once. |
Thanks for the replies! Unfortunately, still having difficulties, so please bear with me.
I moved the socket create and bind to mainLoop, so there is no possibility of those sockets being shared between threads (see repo). That didn't make any difference.
Well, that's what the example code shows (Code Connected, Signaling Between Threads) -- granted this is a somewhat different use-case. If that is not going to work, the only other thing I can think of is to create a sending socket in each thread (in thread-local storage), and connect that to the main receiver socket that is being polled.
If each thread needs its own sending socket, then PAIR sockets don't work, since there can only be a single concurrent connection between sender and receiver (unless I'm reading the docs wrong). So, it sounds like I would need to:
A couple of issues that come to mind:
Just a bit of background here that may make things clearer, and/or suggest other possible solutions/work-arounds:
Again, many thanks for any hints, tips, suggestions, etc.! |
I've put together an exhaustive survey of the inter-thread communications methods here: https://github.com/WallStProg/zmqtests/tree/master/threads. This includes a sample of using PAIR sockets for inter-thread communication that will deadlock every single time. As far as I can tell, the documentation in http://zguide.zeromq.org/page:all#Signaling-Between-Threads-PAIR-Sockets is misleading -- while the PAIR example will work in the simple sample provided, in more "real-world" scenarios it is prone to deadlock. Comments, criticisms, suggestions are welcome! |
That feels a bit hard to believe - zactors are behind half the APIs of CZMQ, and they use PAIR sockets just fine. In real-world weeks-long running scenarios as well. |
That's what I thought too, and I've been tearing my hair out trying to figure it out. Which is why I put the code up -- it's quite possible I'm doing something wrong, and if that's the case I'd love to know what it is. I believe the code is pretty much self-explanatory, but let me know if you have any questions. I suspect the problem may be that I'm creating short-lived connections from multiple threads -- the typical use-case I see is having a single long-lived connection between the two ends of the PAIR. I'm very curious to learn what you come up with. Thanks! |
Constantly creating and destroying sockets is a well known anti pattern, so try to avoid that |
So, granted this is a "stress test" but it is guaranteed to deadlock in my tests. The conclusion I drew from these tests is that pair sockets are not appropriate for inter-thread signalling, nor are pub/sub. Push/pull and client/server seem to work OK. If that's correct, I'd love to get confirmation from someone who knows the code a whole lot better than I do. On the other hand, maybe this behavior is a bug? I really don't know, which is why I'm asking the question. |
I can run this forever, no deadlocks or anything.. |
Sure, but it's a different problem. You have a 1:1 relationship between receiving and sending threads, so you only need to connect once. My use case is a 1:n relationship between receiver and senders. Once you have that, you need to connect and disconnect every time (with PAIR sockets), and that is guaranteed to deadlock. I don't know whether that's a bug or a feature, but as far as I can tell it's not documented. |
If you have a 1:n relantionship, then PAIR it's not the right topology - as the name implies, it's 1:1. As I mentioned, connect and disconnect continuosly is an anti-pattern. |
I keep hearing about "anti-patterns", but I haven't seen them in any documentation. How would I go about finding them? As for PAIR not being the right approach, I understand that (now), but the documentation doesn't make that clear. The only thing the docs say is that PAIR's can only be connected to one other socket at a time. It would have saved me a lot of wasted time if that limitation was pointed out. Also, if PAIR sockets are not intended for other than a single connect, it would be helpful if they returned an error when you ask them to do something "wrong". Last but not least -- so you're saying the deadlock is not a bug? |
The api documentation says pair sockets block forever when their peer doesn't answer and that they can only have one peer for their lifetime. |
That's not what the docs say (emphasis mine):
If the intention is that a PAIR socket can not disconnect and then reconnect (i.e., it's essentially a "one-shot"), it would be helpful if the subsequent connects returned an error. They do not. The current situation appears to be:
|
I think the discussion is a bit vague about what is legal/illegal and what is optimal/suboptimal. @bluca Correct me if I am getting this wrong, but in my understanding by "anti-pattern" you describe something that is legal, but maybe suboptimal. As I understand it, what @WallStProg describes is a legal procedure and should not deadlock. I cannot judge if it is optimal or not, but maybe this issue should focus on the bug (deadlock), and not on how the general approach might be approved. |
First off, from my personal perspective the problem is solved. I've satisfied myself that the only socket types appropriate for inter-thread signalling are push/pull and client/server (and possibly router/dealer, although I did not test those). The other socket types are clearly not appropriate, for one reason or another. That still leaves some open issues, I think:
Many of these issues are documentation issues. I'm happy to help with that -- I'm a rare coder who also likes to write words. But I'd need to know what to write, which is the whole point of this thread. I don't feel like I've been getting straight answers, and that's frustrating. Having said the above, I'm pretty much blown away by what ZeroMQ can do, and I've worked with several other messaging middlewares (TibRV, 29 West, Wombat, etc.). ZeroMQ is great stuff, but using it for real-world applications is not as obvious as it could be. Again, I'm happy to help. |
As we known, zmq socket is not thread-safe. So, the wrapper should be made. I keep an eye on the zmq_poll's thundering herd. |
The sample code that I posted (https://github.com/WallStProg/zmqtests/tree/master/threads) does something similar:
The main thread simply executes zmq_poll on the two sockets until it gets a shutdown command, and then exits. All commands received on the inproc socket are executed on the main thread, and is therefore thread-safe. The sample code uses the approaches suggested in the zmq docs to send "ping" commands to the main thread. The results are documented in the repo, but the main takeaway is that the approaches recommended in the zmq docs simply don't work, or only work in very artificial scenarios (e.g., as in The short version (from repo) is:
The problem is that the docs recommend using PAIR and PUB/SUB sockets for inter-thread communication, but my code indicates that both of these approaches have fatal flaws that only show up under stress. So, either my code is wrong, or the docs are wrong (possibly both ;-) What I'm trying to do here is to figure out which. If it's the docs, I've already offered to help update them. If it's my code, I'm happy to change it, if someone can tell me how to make it work.
Can you explain what you mean by this? |
review this code, I found the action bind/unbind/connect/disconnect and the send/receive methods in different thread. You should put the actions into the same poller thread. It means the APIs user called is a wrapper that post commands to the poller thread. The framework may be like this:
this is my issues for the performance. |
Nope. At no time does any socket get touched by any thread other than the thread that created it. (Except for the CLIENT/SERVER example, where the sockets are thread-safe by design).
If I understand you correctly, you're suggesting that ALL IO be pushed through a generic event queue (e.g., something like libevent). Whether that's a good idea or not (it's not), it isn't relevant to this discussion, which is simply about how to do inter-thread signalling with zmq. |
@WallStProg Can you point me at the bit that stops multiple threads from connecting to the same PAIR inproc endpoint at the same time. As to me it looks like you connect/disconnect with each thread having it's own socket, but no limitation on multiple senders attempting to connect and send. Specifically I'm looking at the code around https://github.com/WallStProg/zmqtests/blob/master/threads/pairs.cpp#L26 |
Correct -- there is no bit that stops multiple threads from trying to connect at the same time. The intent here is to model a "real-world" application, where you don't necessarily know the number of threads ahead of time, nor do you have any control over when they might need to signal the main thread. So, the idea is to let zmq sort it out. Unfortunately, that doesn't appear to work, and eventually the sending thread will deadlock, which was the original point of this thread. FWIW, the API doc has this to say about PAIR threads (emphasis mine):
|
If you tried to connect a second socket to that same connection it will reject it, which could explain a blocked send. However as the connection is rejected I'd have expected you to hit your assert checking the connection result. (Edit: connection can be async so actually the blocked send makes sense there) I have to admit a many to one setup is not something I would have considered a pair to be used for, as, at the very least, you have to synchronise around the connection call. |
The way I read the docs is that there can only be a single PAIR connection at any given time, so pending connects would block until the bind socket becomes available. Coordinating this is not exactly rocket surgery, but in practice zmq deadlocks under load, which I contend is a bug. What I think you're suggesting is that zmq return an error on the connect if the peer is unavailable (i.e., already connected to another PAIR). That's not what it does, and that's not what the docs say, but I think that would be preferable to deadlocking. I don't know if I would have considered using PAIRs either, but that's what the Guide suggests (in fact, it's part of the chapter title): http://zguide.zeromq.org/page:all#Signaling-Between-Threads-PAIR-Sockets And this:
That statement is either incomplete, misleading or just plain wrong -- perhaps all three. My concern has to do with the docs recommending practices that work 99% of the time but that can fail, sometimes spectacularly, under conditions that, at least in my experience, are pretty common. As a zmq noob I have personally been bitten in the backside by several of these, and so I've brought them up as issues. |
The problem with blocking, or even my suggestion of an error code, rather than the way it returns on connect is that you end up there awaiting a potential timeout, and all the associated problems. In this particular case I'd argue that it's only a "deadlock" because the code is attempted to send on an unconnected socket, which would also be the case if you just didn't call connect before. I put "deadlock" in quotes here as the receiver thread isn't blocked on this send thread and will still response to any message from a socket that is actually connected to it. The solution in a real life system would be either locking around the connect/disconnect block or using a unique inproc endpoint for each thread pair and just staying connected. I will point out that the quoted text does say between pairs of threads which is true, the case here is between many threads. Probably we should consider rephrasing the wording in the documentation to clarify this in some way. |
Agreed, but anything is better than deadlock.
That appears not to be the case -- in fact, all threads end up blocked, and none can make progress:
It's not practical to have unique pairs of PAIR's unless you know ahead of time how many threads you're going to have, and that's often not the case. Locking around the connect/disconnect could work, but it seems that is the kind of thing that zmq should be doing itself. (And the docs imply that is the case).
Yay! That's really my main point here. (Well, that plus that deadlock is always a bug ;-) Again, in my experience the only socket types that are reliable in this use-case are PUSH/PULL and CLIENT/SERVER. Using PAIR's or PUB/SUB might work, or might not, depending on the use-case. It would have saved me a lot of trouble if the docs were clear on this, and so I keep banging on this drum so others can avoid having to go down the same rabbit-hole. |
Having them all deadlock seems odd, I would have expected by the last one it wouldn't cause an issue, if being slow from multiple connection/disconnect. It's kind of an odd model here because of using bi directional sockets in a way to be impossible to use bi directionally. Much better off with push/pull for this. Personally I have never used pair sockets outside of testing. |
I hear you -- again, I'm just trying to follow the Golden Rule of software development:
|
Note that for PAIR I read the docs as you must only connect one socket at a time. You are trying to connect multiple and then strange things happen. Somewhere in the middle it was said to also deadlock with just one sender? Is that still the case? Or does it now only deadlock when doing something PAIR was not designed for? If the later then maybe changing the docs to say MUST instead of CAN would be a start or even sufficient as fix for this issue. |
Nope. If you want to comment on this issue, then please read and run the code: https://github.com/WallStProg/zmqtests/tree/master/threads |
I think @benjamg and @mrvn are right here, in that this is fundamentally a documentation issue. The offending line
is clearly misleading; a few sentences prior the this, the first line under the heading Exclusive pair pattern says (my emphasis):
The intention behind PAIR is to support the very common use case of two threads which need to communicate with each other; I think the naming of the pattern as "exclusive pair" is meant to emphasise this. I'm sure I'm not alone in making heavy use of CZMQ's zactor, which uses PAIR internally, and never having had any problems.
This is probably true, but I suspect it won't happen; while it is the intention that a PAIR socket cannot disconnect and then reconnect, I don't believe it's specifically forbidden, so it's possible that disallowing it may break existing code. It is certainly very strongly advisable not to attempt to use elements of the library in ways they weren't intended to be used; doing so may uncover issues which have not previously been encountered. |
What you are basically saying, at least from my perspective, is this:
That doesn't sound so good, certainly not from the point of view of someone trying to build reliable systems. So, if this was just a matter of misleading docs, that would be one thing, but to go back to something I said (much) earlier:
To build reliable systems with ZeroMQ developers need to know what works and what doesn't, and the library needs to do its job telling us when we're doing something wrong. Dismissing these kinds of problems as "well-known anti-patterns" (well-known? by whom?) is what we used to call a "cop-out". It is not helpful, and it does not help to inspire confidence in ZeroMQ. |
Not exactly - I'm saying that using the library as intended (i.e. only connecting a pair socket to precisely one other peer) should always work; using it in ways which were not intended, YMMV.
It is easy to write a program which deadlocks using, for example, your favourite pthreads library. That doesn't mean that your favourite pthreads library is buggy.
Exactly.
Ideally yes, but there is always room for improvement. The cost of using FOSS is that there are likely to be a few rough edges which the community hasn't felt the need to smooth out, for whatever reason. Everyone is encouraged to contribute improvements. |
At the risk of beating a dead horse ...
|
@WallStProg First of all, thank you very much for your efforts in bringing this forward. I also think it is important to do something about this in order not to compromise confidence in using libzmq.
I agree.
With respect to changing the documentation, yes of course, but this is essentially an incompatible change. Existing client code that is legal with the current specification might become illegal. One might argue that it couldn't have worked reliably with the existing implementation, but that's the case with any (implementation) bug. What I don't know if there are some fundamental issues with supporting this usage pattern. In principle, it might be impossible to implement without other inacceptable side-effects. In that case, there might be no other option than to change the specification in an incompatible way, to get back to an implementable specification.
I agree.
I do not agree with this in the generality you state it. I agree that it would be desirable to signal such misuses explicitly, but, again, I don't know if it is possible to detect this specific case with reasonable implementation complexity and runtime overhead. In general, there are misuses that can be detected easily, e.g. passing a parameter to an API function that is statically invalid. But on the other hand, there are misuses in libzmq, the STL, etc. that deliberately cause undefined behaviour. "Undefined behavior" of course includes that it might work as the user expects, portably or non-portably, deterministically or non-deterministically. Almost all misuses of non-thread-safe function calls fall into this category. They cannot be detected at runtime with reasonable effort. Tools like valgrind do this with significant runtime costs, but even those make assumptions on using known synchronization primitives. valgrind knows about pthread mutexes, e.g., but if you implemented your own mutex, it wouldn't know about that and report this as a misuse.
I agree, in the sense that like in other cases of undefined behaviour, this should also be stated explicitly in the docs. |
@sigiesec -- thanks for listening! Really all I've been looking for here is some answers, and I'm delighted that it sounds like you're going to take a look at this. Please let me know if there's anything I can do to help. May I suggest that this particular trail of breadcrumbs is a bit "un-tidy" at this point, and it might be best to open a new issue to explore the disconnect/(re-)connect question, but that's your call. |
@WallStProg I finally had a look at your test program in https://github.com/WallStProg/zmqtests/tree/master/threads. I must admit that I haven't read all details of the discussion here before, so I need to add a few points:
Unfortunately, since your program uses pthreads, I cannot simply test it on my Windows machine. So two things you could do:
|
That's interesting - I didn't know that. I should probably get more familiar with the spec, so I checked https://rfc.zeromq.org/spec:31/EXPAIR, but frankly I don't see where it says that. If you can help me understand better how to read the spec, I'd appreciate it.
I was also thinking of putting a mutex on the SUB side of the PAIR and seeing if there's any difference. BTW, I seem to remember reading somewhere that connects on inproc sockets are synchronous, but I can't find the reference -- is that true?
If that's the only hurdle, maybe it would help if I switched to Boost threads? I think that should work on Windows, but I don't have a Windows machine to test with.
Yes, in the sense that it will eventually hang in
Here's the stack:
As above, I'll try this with both a mutex and the socket monitor -- the latter will take a bit more work. I'll post back here when I have something. |
The 31/EXPAIR spec says:
I am not completely sure what you are trying to do overall, but I think you should somehow use DEALER and ROUTER sockets to achieve your goals. The basic mechanism is described here http://zguide.zeromq.org/page:all#The-Asynchronous-Client-Server-Pattern and there are several more advanced patterns based on that.
That would synchronize your threads and might solve the issue above, but this is not what I meant. I meant synchronizing your thread with the libzmq internal asynchronous handling.
connect -> yes
Hm... std::thread would be better for me testing it locally. However, if this were to be included in the 0MQ test suite, neither could be used, but only zmq_thread*, but these do not provide TLS.
That's not what I meant with "reliably" ;) So basically this means "no".
Ok, thanks. But please try with a single thread. |
I ended up using CLIENT/SERVER sockets for my application: they're thread-safe, which is handy, and a bit simpler since they don't require multi-part messages.
You're right -- wrapping the connect/send/disconnect in a mutex doesn't help at all. (WallStProg/zmqtests@f9a448d#diff-0919fe44fdbbd233e5e2e8587006b7b2)
I think this is already answered -- a single thread doing connect/send/disconnect will eventually deadlock on the zmq_send (see earlier stack trace). I'll try tapping into zmq_socket_monitor next, but that will be a bit more work -- will report back. Thanks again! |
This issue has been automatically marked as stale because it has not had activity for 365 days. It will be closed if no further activity occurs within 56 days. Thank you for your contributions. |
Issue description
Using the pattern described in "Signaling between Threads” zmq_send deadlocks when the receiving (bind) thread is in zmq_poll.
Environment
What's the actual result? (include assertion message & call stack if applicable)
Call stack attached. There appears to be a deadlock between thread 1 & thread 5.
What's the expected result?
zmq_send should succeed and trigger zmq_poll to return with read event on the PAIR socket.
poll-hang.txt
The text was updated successfully, but these errors were encountered: