Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deadlock between zmq_poll, zmq_send using inproc PAIR sockets #2759

Closed
WallStProg opened this issue Sep 28, 2017 · 49 comments
Closed

deadlock between zmq_poll, zmq_send using inproc PAIR sockets #2759

WallStProg opened this issue Sep 28, 2017 · 49 comments

Comments

@WallStProg
Copy link
Contributor

Issue description

Using the pattern described in "Signaling between Threads” zmq_send deadlocks when the receiving (bind) thread is in zmq_poll.

Environment

  • libzmq version (commit hash if unreleased): d175819
  • OS: CentOS 6

What's the actual result? (include assertion message & call stack if applicable)

Call stack attached. There appears to be a deadlock between thread 1 & thread 5.

What's the expected result?

zmq_send should succeed and trigger zmq_poll to return with read event on the PAIR socket.

poll-hang.txt

@Asmod4n
Copy link
Contributor

Asmod4n commented Sep 29, 2017

Do you happen to share the sockets between threads?
Could you give a code example to reproduce it?

@WallStProg
Copy link
Contributor Author

WallStProg commented Sep 29, 2017 via email

@Kentzo
Copy link
Contributor

Kentzo commented Oct 4, 2017

Why do you think there is a deadlock? Poll's timeout is -1, it could just wait for something to arrive.

@WallStProg
Copy link
Contributor Author

WallStProg commented Oct 4, 2017 via email

@WallStProg
Copy link
Contributor Author

WallStProg commented Oct 4, 2017 via email

WallStProg pushed a commit to WallStProg/zmqtests that referenced this issue Oct 6, 2017
@WallStProg
Copy link
Contributor Author

I've created a sample program that demonstrates the problem, using "pure" 0MQ code. The repo is at https://github.com/WallStProg/zmqtests.git.

Please see the README.md etc. in threads directory.

As mentioned above, PAIR sockets deadlock, PUB sockets don't, but both silently drop (lots of) messages.

I imagine that I'm doing something wrong here, but for the life of me I can't see what it is. Any help would be appreciated!

@bluca
Copy link
Member

bluca commented Oct 6, 2017

As the documentation says socket must not be used from multiple threads.

@WallStProg
Copy link
Contributor Author

Which socket are you referring to? The controlReceiver is being used from a single thread, the controlSender is created and destroyed from the sending threads each time.

As far as I can tell, there is no sharing of sockets between threads. That is in fact the reason for using the controlSub -- to be able to execute commands on the dataSub socket from a single thread.

If you can be more specific about what you think is wrong, that would be very helpful.

@bluca
Copy link
Member

bluca commented Oct 6, 2017

Unless I'm mistaken, sockets are created and bound in the main thread:

https://github.com/WallStProg/zmqtests/blob/master/threads/threads.cpp#L197
https://github.com/WallStProg/zmqtests/blob/master/threads/threads.cpp#L201

And then used from another thread:

https://github.com/WallStProg/zmqtests/blob/master/threads/threads.cpp#L141

Also creating and immediately deleting a socket is an anti-pattern:

https://github.com/WallStProg/zmqtests/blob/master/threads/threads.cpp#L62

It's better to make them long-lived and reuse them.

@Asmod4n
Copy link
Contributor

Asmod4n commented Oct 6, 2017

Especially with ZMQ_PAIR Sockets its better to only bind and connect once.

@WallStProg
Copy link
Contributor Author

WallStProg commented Oct 7, 2017

Thanks for the replies! Unfortunately, still having difficulties, so please bear with me.

Unless I'm mistaken, sockets are created and bound in the main thread...And then used from another thread:

I moved the socket create and bind to mainLoop, so there is no possibility of those sockets being shared between threads (see repo). That didn't make any difference.

Also creating and immediately deleting a socket is an anti-pattern:

Well, that's what the example code shows (Code Connected, Signaling Between Threads) -- granted this is a somewhat different use-case.

If that is not going to work, the only other thing I can think of is to create a sending socket in each thread (in thread-local storage), and connect that to the main receiver socket that is being polled.

Especially with ZMQ_PAIR Sockets its better to only bind and connect once.

If each thread needs its own sending socket, then PAIR sockets don't work, since there can only be a single concurrent connection between sender and receiver (unless I'm reading the docs wrong).

So, it sounds like I would need to:

  • create and bind a SUB socket in the main thread and include that in zmq_poll, in addition to any other sockets I need to read from.
  • create a PUB socket in each thread in thread-local storage, and connect the socket to SUB socket's endpoint.

A couple of issues that come to mind:

  • This means that PUB sockets would be connecting to SUB sockets, over inproc transport. There's been a lot of discussion about problems with PUBs connecting to SUBs, so is there anything special that needs to be done to make that work?
  • I'm not sure how I would clean up the individual thread's sending sockets, although I seem to recall that with later versions of ZeroMQ there is a way to force the context to terminate, even if there are open sockets? Can anyone confirm/refute that?

Just a bit of background here that may make things clearer, and/or suggest other possible solutions/work-arounds:

  • The control socket is there for two purposes only:
    • the first is to allow different threads to subscribe/unsubscribe dynamically by sending commands to the main thread that is polling the "data" socket;
    • the second is to have a way to terminate the zmq_poll cleanly at shutdown, again by sending a command that breaks out of the loop.
  • The application can create an arbitrary number of threads, so any solution needs to be scalable.

Again, many thanks for any hints, tips, suggestions, etc.!

@WallStProg
Copy link
Contributor Author

I've put together an exhaustive survey of the inter-thread communications methods here: https://github.com/WallStProg/zmqtests/tree/master/threads.

This includes a sample of using PAIR sockets for inter-thread communication that will deadlock every single time.

As far as I can tell, the documentation in http://zguide.zeromq.org/page:all#Signaling-Between-Threads-PAIR-Sockets is misleading -- while the PAIR example will work in the simple sample provided, in more "real-world" scenarios it is prone to deadlock.

Comments, criticisms, suggestions are welcome!

@bluca
Copy link
Member

bluca commented Oct 22, 2017

As far as I can tell, the documentation in http://zguide.zeromq.org/page:all#Signaling-Between-Threads-PAIR-Sockets is misleading -- while the PAIR example will work in the simple sample provided, in more "real-world" scenarios it is prone to deadlock.

That feels a bit hard to believe - zactors are behind half the APIs of CZMQ, and they use PAIR sockets just fine. In real-world weeks-long running scenarios as well.

@WallStProg
Copy link
Contributor Author

That's what I thought too, and I've been tearing my hair out trying to figure it out.

Which is why I put the code up -- it's quite possible I'm doing something wrong, and if that's the case I'd love to know what it is.

I believe the code is pretty much self-explanatory, but let me know if you have any questions.

I suspect the problem may be that I'm creating short-lived connections from multiple threads -- the typical use-case I see is having a single long-lived connection between the two ends of the PAIR.

I'm very curious to learn what you come up with.

Thanks!

@bluca
Copy link
Member

bluca commented Oct 22, 2017

Constantly creating and destroying sockets is a well known anti pattern, so try to avoid that

@WallStProg
Copy link
Contributor Author

  • The code doesn't do that -- it creates a single receiver and a single sender per thread.
  • The pairs example deadlocks with a single sender thread (see the README).
  • The pairs example does connect and disconnect in a tight loop, which is necessary with PAIR sockets to support multiple senders, since only one connection can be active at any given time.

So, granted this is a "stress test" but it is guaranteed to deadlock in my tests.

The conclusion I drew from these tests is that pair sockets are not appropriate for inter-thread signalling, nor are pub/sub. Push/pull and client/server seem to work OK.

If that's correct, I'd love to get confirmation from someone who knows the code a whole lot better than I do.

On the other hand, maybe this behavior is a bug? I really don't know, which is why I'm asking the question.

@Asmod4n
Copy link
Contributor

Asmod4n commented Oct 23, 2017

I can run this forever, no deadlocks or anything..
https://gist.github.com/Asmod4n/7af7f15e11f81aff0f5f4e4c62d79c92

@WallStProg
Copy link
Contributor Author

I can run this forever, no deadlocks or anything..

Sure, but it's a different problem. You have a 1:1 relationship between receiving and sending threads, so you only need to connect once.

My use case is a 1:n relationship between receiver and senders. Once you have that, you need to connect and disconnect every time (with PAIR sockets), and that is guaranteed to deadlock.

I don't know whether that's a bug or a feature, but as far as I can tell it's not documented.

@bluca
Copy link
Member

bluca commented Oct 23, 2017

If you have a 1:n relantionship, then PAIR it's not the right topology - as the name implies, it's 1:1. As I mentioned, connect and disconnect continuosly is an anti-pattern.
Use REP-REQ or ROUTER-DEALER for 1:n

@WallStProg
Copy link
Contributor Author

I keep hearing about "anti-patterns", but I haven't seen them in any documentation. How would I go about finding them?

As for PAIR not being the right approach, I understand that (now), but the documentation doesn't make that clear. The only thing the docs say is that PAIR's can only be connected to one other socket at a time.

It would have saved me a lot of wasted time if that limitation was pointed out.

Also, if PAIR sockets are not intended for other than a single connect, it would be helpful if they returned an error when you ask them to do something "wrong".

Last but not least -- so you're saying the deadlock is not a bug?

@Asmod4n
Copy link
Contributor

Asmod4n commented Oct 23, 2017

The api documentation says pair sockets block forever when their peer doesn't answer and that they can only have one peer for their lifetime.
http://api.zeromq.org/master:zmq-socket#toc17
(Maybe the documentation could be updated to show if sockets are 1:1 or 1:n or n:m)

@WallStProg
Copy link
Contributor Author

WallStProg commented Oct 23, 2017

That's not what the docs say (emphasis mine):

A socket of type ZMQ_PAIR can only be connected to a single peer at any one time.

If the intention is that a PAIR socket can not disconnect and then reconnect (i.e., it's essentially a "one-shot"), it would be helpful if the subsequent connects returned an error. They do not.

The current situation appears to be:

  • A "listener" can bind a PAIR socket to an endpoint, then go into a zmq_poll loop on that socket (and possibly other sockets).
  • A "sender" can connect to that endpoint, send one (or more?) messages, and then disconnect.
  • The same sender can re-connect to the endpoint -- the connect will (appear to) succeed.
  • A subsequent send will immediately deadlock.
    • For whatever reason, issuing a zmq_poll on the socket after the subsequent connect will delay the inevitable deadlock, but eventually it will deadlock on the send.

@sigiesec
Copy link
Member

I think the discussion is a bit vague about what is legal/illegal and what is optimal/suboptimal.

@bluca Correct me if I am getting this wrong, but in my understanding by "anti-pattern" you describe something that is legal, but maybe suboptimal.

As I understand it, what @WallStProg describes is a legal procedure and should not deadlock. I cannot judge if it is optimal or not, but maybe this issue should focus on the bug (deadlock), and not on how the general approach might be approved.

@WallStProg
Copy link
Contributor Author

First off, from my personal perspective the problem is solved. I've satisfied myself that the only socket types appropriate for inter-thread signalling are push/pull and client/server (and possibly router/dealer, although I did not test those). The other socket types are clearly not appropriate, for one reason or another.

That still leaves some open issues, I think:

  1. The docs are misleading at best in several areas, and this is one of them.
    1. The chapter on inter-thread signaling should be updated to reflect the issues with other socket types, and to recommend push/pull or client/server.
    2. The documentation on PAIR sockets is unclear and/or misleading. Either you can disconnect and re-connect or not, and the docs should say one way or the other.
    3. If re-connecting is an "anti-pattern" the code should enforce that and return an error.
    4. A similar situation exists with regard to PUBs connecting to SUBs (First subscriptions seem to be dropped when a PUB connects to a SUB #2267). The "work-around" suggested doesn't work, except in a contrived example (https://gist.github.com/hintjens/7344533) where a single monolithic program knows when to poll on the SUB socket. At a minimum the problem of PUBs connecting to SUBs needs to be documented much more clearly, as opposed to being "swept under the rug", as it is in most of the docs.
    5. Many of the example programs are "toys" and don't reflect what one would realisitically need to do in a "real-world" application. Extrapolating from those toy examples to real code is dangerous.
    6. It would also be helpful to have real-world examples of how to make ZeroMQ sockets thread-safe using TLS. I published one approach that seems to work, but there are probably others.
  2. Behavior needs to be predictable, deterministic.
    1. It's not OK for code to behave one way under load and another way when not. Especially combined with the problem of "toy" examples, it can mislead someone into thinking that a particular "pattern" will work, when it instead fails in strange and interesting ways.
  3. There is no scenario where deadlock is not a bug.
  4. I'm still looking for a collection of these "well-known anti-patterns" that I seem to keep bumping into. I would think that would be helpful to others new to ZeroMQ.

Many of these issues are documentation issues. I'm happy to help with that -- I'm a rare coder who also likes to write words. But I'd need to know what to write, which is the whole point of this thread. I don't feel like I've been getting straight answers, and that's frustrating.

Having said the above, I'm pretty much blown away by what ZeroMQ can do, and I've worked with several other messaging middlewares (TibRV, 29 West, Wombat, etc.).

ZeroMQ is great stuff, but using it for real-world applications is not as obvious as it could be. Again, I'm happy to help.

@loachfish
Copy link

As we known, zmq socket is not thread-safe. So, the wrapper should be made.
In our practice, a data handle includes three zmq_socket at least, a pair for user-sender called PUSH, other pair for receiving the push's data called PULL, and the another zmq_socket is typed the data handle attributed. It should be router, dealer, pub or other types. We put this data handle object into a poller thread. So the send and receive methods in the zmq_socket layer are in the same thread.

I keep an eye on the zmq_poll's thundering herd.

@WallStProg
Copy link
Contributor Author

In our practice, a data handle includes three zmq_socket at least...

The sample code that I posted (https://github.com/WallStProg/zmqtests/tree/master/threads) does something similar:

  1. The main thread creates an inproc socket, which is used by other threads to send commands (subscribe, unsubscribe, shutdown) to the main thread
  2. It also creates a SUB socket, which is the main data channel for the app (not used in the sample, but it would be in a "real-world" app).

The main thread simply executes zmq_poll on the two sockets until it gets a shutdown command, and then exits. All commands received on the inproc socket are executed on the main thread, and is therefore thread-safe.

The sample code uses the approaches suggested in the zmq docs to send "ping" commands to the main thread. The results are documented in the repo, but the main takeaway is that the approaches recommended in the zmq docs simply don't work, or only work in very artificial scenarios (e.g., as in
https://gist.github.com/hintjens/7344533, where the sub socket is somehow smart enough to "know" that it needs to poll after the pub socket connects to it).

The short version (from repo) is:

  • PAIR sockets don't work -- they will eventually deadlock.
  • PUB/SUB sockets don't work -- under certain conditions, the initial "n" messages sent will be lost.
  • PUSH/PULL sockets do work:
  • CLIENT/SERVER sockets also work, and have the added benefit that they are thread-safe by design.

The problem is that the docs recommend using PAIR and PUB/SUB sockets for inter-thread communication, but my code indicates that both of these approaches have fatal flaws that only show up under stress.

So, either my code is wrong, or the docs are wrong (possibly both ;-)

What I'm trying to do here is to figure out which. If it's the docs, I've already offered to help update them. If it's my code, I'm happy to change it, if someone can tell me how to make it work.

I keep an eye on the zmq_poll's thundering herd.

Can you explain what you mean by this?

@loachfish
Copy link

loachfish commented Nov 17, 2017

(https://github.com/WallStProg/zmqtests/tree/master/threads)

review this code, I found the action bind/unbind/connect/disconnect and the send/receive methods in different thread. You should put the actions into the same poller thread. It means the APIs user called is a wrapper that post commands to the poller thread. The framework may be like this:

  1. user post request for bind/unbind/connect/disconnect via a unlock msg queue, the receiver is the poller thread.
  2. user call send through the inner pair-pair, and the real send operation is in the poller thread.
  3. poller thread monitor the zmq_socket's monitor event and POLL_IN and callback to the user. In advise, it should be posted to a dispatch thread then callback to user.
  4. should care other actions that should cause multi-thread, how do we know which action may cause the problems. In my understanding, the methods is the API which will call zmq::socket_base_t::process_commands.

I keep an eye on the zmq_poll's thundering herd.

this is my issues for the performance.

@WallStProg
Copy link
Contributor Author

review this code, I found the action bind/unbind/connect/disconnect and the send/receive methods in different thread

Nope. At no time does any socket get touched by any thread other than the thread that created it. (Except for the CLIENT/SERVER example, where the sockets are thread-safe by design).

The framework may be like this:

If I understand you correctly, you're suggesting that ALL IO be pushed through a generic event queue (e.g., something like libevent). Whether that's a good idea or not (it's not), it isn't relevant to this discussion, which is simply about how to do inter-thread signalling with zmq.

@benjamg
Copy link
Member

benjamg commented Nov 17, 2017

@WallStProg Can you point me at the bit that stops multiple threads from connecting to the same PAIR inproc endpoint at the same time. As to me it looks like you connect/disconnect with each thread having it's own socket, but no limitation on multiple senders attempting to connect and send.

Specifically I'm looking at the code around https://github.com/WallStProg/zmqtests/blob/master/threads/pairs.cpp#L26

@WallStProg
Copy link
Contributor Author

Correct -- there is no bit that stops multiple threads from trying to connect at the same time.

The intent here is to model a "real-world" application, where you don't necessarily know the number of threads ahead of time, nor do you have any control over when they might need to signal the main thread.

So, the idea is to let zmq sort it out. Unfortunately, that doesn't appear to work, and eventually the sending thread will deadlock, which was the original point of this thread.

FWIW, the API doc has this to say about PAIR threads (emphasis mine):

When a ZMQ_PAIR socket enters the mute state due to having reached the high water mark for the connected peer, or if no peer is connected, then any zmq_send(3) operations on the socket shall block until the peer becomes available for sending; messages are not discarded.

@benjamg
Copy link
Member

benjamg commented Nov 17, 2017

If you tried to connect a second socket to that same connection it will reject it, which could explain a blocked send. However as the connection is rejected I'd have expected you to hit your assert checking the connection result. (Edit: connection can be async so actually the blocked send makes sense there)

I have to admit a many to one setup is not something I would have considered a pair to be used for, as, at the very least, you have to synchronise around the connection call.

@WallStProg
Copy link
Contributor Author

The way I read the docs is that there can only be a single PAIR connection at any given time, so pending connects would block until the bind socket becomes available. Coordinating this is not exactly rocket surgery, but in practice zmq deadlocks under load, which I contend is a bug.

What I think you're suggesting is that zmq return an error on the connect if the peer is unavailable (i.e., already connected to another PAIR). That's not what it does, and that's not what the docs say, but I think that would be preferable to deadlocking.

I don't know if I would have considered using PAIRs either, but that's what the Guide suggests (in fact, it's part of the chapter title): http://zguide.zeromq.org/page:all#Signaling-Between-Threads-PAIR-Sockets

And this:

For these reasons, PAIR makes the best choice for coordination between pairs of threads.

That statement is either incomplete, misleading or just plain wrong -- perhaps all three.

My concern has to do with the docs recommending practices that work 99% of the time but that can fail, sometimes spectacularly, under conditions that, at least in my experience, are pretty common. As a zmq noob I have personally been bitten in the backside by several of these, and so I've brought them up as issues.

@benjamg
Copy link
Member

benjamg commented Nov 17, 2017

The problem with blocking, or even my suggestion of an error code, rather than the way it returns on connect is that you end up there awaiting a potential timeout, and all the associated problems.

In this particular case I'd argue that it's only a "deadlock" because the code is attempted to send on an unconnected socket, which would also be the case if you just didn't call connect before.

I put "deadlock" in quotes here as the receiver thread isn't blocked on this send thread and will still response to any message from a socket that is actually connected to it. The solution in a real life system would be either locking around the connect/disconnect block or using a unique inproc endpoint for each thread pair and just staying connected. I will point out that the quoted text does say between pairs of threads which is true, the case here is between many threads.

Probably we should consider rephrasing the wording in the documentation to clarify this in some way.

@WallStProg
Copy link
Contributor Author

The problem with blocking, or even my suggestion of an error code, rather than the way it returns on connect is that you end up there awaiting a potential timeout, and all the associated problems.

Agreed, but anything is better than deadlock.

I put "deadlock" in quotes here as the receiver thread isn't blocked on this send thread and will still response to any message from a socket that is actually connected to it.

That appears not to be the case -- in fact, all threads end up blocked, and none can make progress:

/home/btorpey/work/zmqtests/threads $ ./pairs -msgs 10000 -threads 10 -quiet -poll
Using libzmq version 4.2.3
Using pair sockets
Sleeping for 1 seconds at shutdown
Polling after connect
^C
Thread 2113920 sent # 3664, received # 3663
Thread 2113928 sent # 1, received # 0
Thread 2113930 sent # 1, received # 0
Thread 2113938 sent # 1, received # 0
Thread 2113940 sent # 1, received # 0
Thread 2113948 sent # 1, received # 0
Thread 2113950 sent # 1, received # 0
Thread 2113958 sent # 1, received # 0
Thread 2113960 sent # 1, received # 0
Thread 2113968 sent # 1, received # 0
/home/btorpey/work/zmqtests/threads $ 

The solution in a real life system would be either locking around the connect/disconnect block or using a unique inproc endpoint for each thread pair and just staying connected. I will point out that the quoted text does say between pairs of threads which is true, the case here is between many threads.

It's not practical to have unique pairs of PAIR's unless you know ahead of time how many threads you're going to have, and that's often not the case.

Locking around the connect/disconnect could work, but it seems that is the kind of thing that zmq should be doing itself. (And the docs imply that is the case).

Probably we should consider rephrasing the wording in the documentation to clarify this in some way.

Yay! That's really my main point here. (Well, that plus that deadlock is always a bug ;-)

Again, in my experience the only socket types that are reliable in this use-case are PUSH/PULL and CLIENT/SERVER. Using PAIR's or PUB/SUB might work, or might not, depending on the use-case. It would have saved me a lot of trouble if the docs were clear on this, and so I keep banging on this drum so others can avoid having to go down the same rabbit-hole.

@benjamg
Copy link
Member

benjamg commented Nov 17, 2017

Having them all deadlock seems odd, I would have expected by the last one it wouldn't cause an issue, if being slow from multiple connection/disconnect.

It's kind of an odd model here because of using bi directional sockets in a way to be impossible to use bi directionally. Much better off with push/pull for this.

Personally I have never used pair sockets outside of testing.

@WallStProg
Copy link
Contributor Author

I hear you -- again, I'm just trying to follow the Golden Rule of software development:

When all else fails, read the instructions.

@mrvn
Copy link
Contributor

mrvn commented Apr 10, 2018

Note that for PAIR I read the docs as you must only connect one socket at a time. You are trying to connect multiple and then strange things happen.

Somewhere in the middle it was said to also deadlock with just one sender? Is that still the case? Or does it now only deadlock when doing something PAIR was not designed for?

If the later then maybe changing the docs to say MUST instead of CAN would be a start or even sufficient as fix for this issue.

@WallStProg
Copy link
Contributor Author

WallStProg commented Apr 11, 2018

Note that for PAIR I read the docs as you must only connect one socket at a time. You are trying to connect multiple and then strange things happen.

Nope.

If you want to comment on this issue, then please read and run the code: https://github.com/WallStProg/zmqtests/tree/master/threads

@stephenprocter
Copy link

I think @benjamg and @mrvn are right here, in that this is fundamentally a documentation issue. The offending line

A socket of type ZMQ_PAIR can only be connected to a single peer at any one time.

is clearly misleading; a few sentences prior the this, the first line under the heading Exclusive pair pattern says (my emphasis):

The exclusive pair pattern is used to connect a peer to precisely one other peer.

The intention behind PAIR is to support the very common use case of two threads which need to communicate with each other; I think the naming of the pattern as "exclusive pair" is meant to emphasise this. I'm sure I'm not alone in making heavy use of CZMQ's zactor, which uses PAIR internally, and never having had any problems.

@WallStProg:

If the intention is that a PAIR socket can not disconnect and then reconnect (i.e., it's essentially a "one-shot"), it would be helpful if the subsequent connects returned an error.

This is probably true, but I suspect it won't happen; while it is the intention that a PAIR socket cannot disconnect and then reconnect, I don't believe it's specifically forbidden, so it's possible that disallowing it may break existing code. It is certainly very strongly advisable not to attempt to use elements of the library in ways they weren't intended to be used; doing so may uncover issues which have not previously been encountered.

@WallStProg
Copy link
Contributor Author

WallStProg commented May 10, 2018

What you are basically saying, at least from my perspective, is this:

Doing "X" will usually work, but sometimes it won't. There is no way to tell whether "X" will fail at some point. When it does fail, it may fail silently, or it may deadlock the process.

That doesn't sound so good, certainly not from the point of view of someone trying to build reliable systems.

So, if this was just a matter of misleading docs, that would be one thing, but to go back to something I said (much) earlier:

There is no scenario where deadlock is not a bug.

To build reliable systems with ZeroMQ developers need to know what works and what doesn't, and the library needs to do its job telling us when we're doing something wrong.

Dismissing these kinds of problems as "well-known anti-patterns" (well-known? by whom?) is what we used to call a "cop-out". It is not helpful, and it does not help to inspire confidence in ZeroMQ.

@stephenprocter
Copy link

What you are basically saying, at least from my perspective, is this: Doing "X" will usually work, but sometimes it won't. There is no way to tell whether "X" will fail at some point. When it does fail, it may fail silently, or it may deadlock the process.

Not exactly - I'm saying that using the library as intended (i.e. only connecting a pair socket to precisely one other peer) should always work; using it in ways which were not intended, YMMV.

There is no scenario where deadlock is not a bug.

It is easy to write a program which deadlocks using, for example, your favourite pthreads library. That doesn't mean that your favourite pthreads library is buggy.

To build reliable systems with ZeroMQ developers need to know what works and what doesn't

Exactly.

and the library needs to do its job telling us when we're doing something wrong.

Ideally yes, but there is always room for improvement. The cost of using FOSS is that there are likely to be a few rough edges which the community hasn't felt the need to smooth out, for whatever reason. Everyone is encouraged to contribute improvements.

@WallStProg
Copy link
Contributor Author

At the risk of beating a dead horse ...

  • To make things clearer, I've added a -rate parameter to the tests that controls how often the send thread runs. (Up till now, the send thread always ran flat-out).

    On my test machine, the pairs disconnect/(re-)connect test appears to work reliably (threads/pairs -threads 1 -msgs 1000 -quiet -rate 1000) (i.e., with a 1ms delay between connection attempts).

    However, when the test program is run in a loop, it hangs after some number of iterations (in one case, on iteration 1104).

    This behavior is very dangerous -- it can lead people to believe that this (anti-?) pattern is reliable, when it is not.

    As a designer of higher-level APIs, I need the libraries I use to be reliable and consistent. If they are not, I run the risk of building software that appears to work, but that mis-behaves (e.g., hangs, crashes) under load.

    IMO, this behavior is clearly not OK, and I certainly hope that the project maintainers agree.

  • If ZMQ_PAIR sockets don't support disconnect/(re-)connect, then the docs are unclear about that. That would be a doc bug, and presumably a relatively simple fix.

    I don't know whether disconnecting and re-connecting ZMQ_PAIR sockets is supposed to work -- that's what I'm asking. Lots of people have expressed opinions, but I would expect a definitive answer to come from one of the project's maintainers -- at this point that looks to be either Luca or Simon, but perhaps others also?

  • If disconnect/(re-)connect is supported, then there's a bug in the code, because it doesn't work (reliably).

    This I think is demonstrated by the sample code: (https://github.com/WallStProg/zmqtests/tree/master/threads)

  • Even if disconnect/(re-)connect is not supported, there's still a (different) bug in the code, because it should not cause deadlock.

    I could see some disagreement about this one. Maybe returning an error in this scenario is really hard to do, given the way the code is structured. OK, so the bug might not get fixed anytime soon, but it's still a bug and should at least be in the bug db, if only so it can be pointed out to poor slobs (like me) who can't understand why sometimes it works and sometimes it doesn't. (I would also suggest that if this is the case then a REALLY BIG DISCLAIMER should be inserted in the docs to warn people ("here be dragons")).

    Or, maybe it's really expensive to detect this situation, and we don't want to burden everyone with the cost of doing the check. Fine, but then we might implement a sockopt that could enable/disable the check (and potentially others like it?).

@sigiesec
Copy link
Member

@WallStProg First of all, thank you very much for your efforts in bringing this forward. I also think it is important to do something about this in order not to compromise confidence in using libzmq.

If ZMQ_PAIR sockets don't support disconnect/(re-)connect, then the docs are unclear about that. That would be a doc bug, ...

I agree.

... and presumably a relatively simple fix.

With respect to changing the documentation, yes of course, but this is essentially an incompatible change. Existing client code that is legal with the current specification might become illegal. One might argue that it couldn't have worked reliably with the existing implementation, but that's the case with any (implementation) bug.

What I don't know if there are some fundamental issues with supporting this usage pattern. In principle, it might be impossible to implement without other inacceptable side-effects. In that case, there might be no other option than to change the specification in an incompatible way, to get back to an implementable specification.

If disconnect/(re-)connect is supported, then there's a bug in the code, because it doesn't work (reliably).

I agree.

Even if disconnect/(re-)connect is not supported, there's still a (different) bug in the code, because it should not cause deadlock.

I do not agree with this in the generality you state it. I agree that it would be desirable to signal such misuses explicitly, but, again, I don't know if it is possible to detect this specific case with reasonable implementation complexity and runtime overhead.

In general, there are misuses that can be detected easily, e.g. passing a parameter to an API function that is statically invalid. But on the other hand, there are misuses in libzmq, the STL, etc. that deliberately cause undefined behaviour. "Undefined behavior" of course includes that it might work as the user expects, portably or non-portably, deterministically or non-deterministically. Almost all misuses of non-thread-safe function calls fall into this category. They cannot be detected at runtime with reasonable effort. Tools like valgrind do this with significant runtime costs, but even those make assumptions on using known synchronization primitives. valgrind knows about pthread mutexes, e.g., but if you implemented your own mutex, it wouldn't know about that and report this as a misuse.

(I would also suggest that if this is the case then a REALLY BIG DISCLAIMER should be inserted in the docs to warn people ("here be dragons")).

I agree, in the sense that like in other cases of undefined behaviour, this should also be stated explicitly in the docs.

@WallStProg
Copy link
Contributor Author

@sigiesec -- thanks for listening!

Really all I've been looking for here is some answers, and I'm delighted that it sounds like you're going to take a look at this. Please let me know if there's anything I can do to help.

May I suggest that this particular trail of breadcrumbs is a bit "un-tidy" at this point, and it might be best to open a new issue to explore the disconnect/(re-)connect question, but that's your call.

@sigiesec
Copy link
Member

@WallStProg I finally had a look at your test program in https://github.com/WallStProg/zmqtests/tree/master/threads. I must admit that I haven't read all details of the discussion here before, so I need to add a few points:

  • Multiple concurrent connection attempts from different sockets to the same ZMQ_PAIR endpoint (i.e. your program with numThreads > 1) are not allowed by the current specification. The documentation should be improved to make it clear that this causes undefined behaviour rather than connection attempts queueing up.

  • A sequence of [connect/disconnect]+ calls from a single thread (numThreads == 1) is the question I had in mind when writing my previous comment. I now think that this might be problematic because disconnect works asynchronously, and a new connect would be allowed only after the disconnect completed. You could add a socket monitor to wait for the ZMQ_EVENT_DISCONNECTED event to synchronize this.

Unfortunately, since your program uses pthreads, I cannot simply test it on my Windows machine. So two things you could do:

  1. Does your program work reliably with numThreads == 1?
  2. If it does not, does it work reliably when you add a socket monitor and wait for a ZMQ_EVENT_DISCONNECTED after calling zmq_disconnect?

@WallStProg
Copy link
Contributor Author

Multiple concurrent connection ... are not allowed by the current specification

That's interesting - I didn't know that. I should probably get more familiar with the spec, so I checked https://rfc.zeromq.org/spec:31/EXPAIR, but frankly I don't see where it says that. If you can help me understand better how to read the spec, I'd appreciate it.


...You could add a socket monitor to wait for the ZMQ_EVENT_DISCONNECTED event to synchronize this.

I was also thinking of putting a mutex on the SUB side of the PAIR and seeing if there's any difference.

BTW, I seem to remember reading somewhere that connects on inproc sockets are synchronous, but I can't find the reference -- is that true?


Unfortunately, since your program uses pthreads, I cannot simply test it on my Windows machine.

If that's the only hurdle, maybe it would help if I switched to Boost threads? I think that should work on Windows, but I don't have a Windows machine to test with.


  1. Does your program work reliably with numThreads == 1?

Yes, in the sense that it will eventually hang in zmq_send.

*** Iteration  337
threads/pairs -threads 1 -msgs 1000 -quiet -rate 1000 -poll
Using libzmq version 4.2.3
Using pair sockets
Inter-message sleep = 1000 us
Polling after connect
Sleeping for 1 seconds at shutdown

Here's the stack:

[/home/btorpey] pstack 1397
Thread 5 (Thread 0x7f36a73d8700 (LWP 1399)):
#0  0x00000033bfedf383 in poll () from /lib64/libc.so.6
#1  0x00007f36a7a1859a in zmq::socket_poller_t::wait(zmq::socket_poller_t::event_t*, int, long) () from /home/btorpey/install/libzmq/4.2.3/dev/lib64/libzmq.so.5.1.3
#2  0x00007f36a7a160b9 in zmq_poller_wait_all () from /home/btorpey/install/libzmq/4.2.3/dev/lib64/libzmq.so.5.1.3
#3  0x00007f36a7a168ff in zmq_poller_poll(zmq_pollitem_t*, int, long) () from /home/btorpey/install/libzmq/4.2.3/dev/lib64/libzmq.so.5.1.3
#4  0x00007f36a7a15a4c in zmq_poll () from /home/btorpey/install/libzmq/4.2.3/dev/lib64/libzmq.so.5.1.3
#5  0x0000000000405091 in mainLoop(void*) ()
#6  0x00000033c0607aa1 in start_thread () from /lib64/libpthread.so.0
#7  0x00000033bfee8bcd in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x7f36a69d7700 (LWP 1400)):
#0  0x00000033bfedf383 in poll () from /lib64/libc.so.6
#1  0x00007f36a79efc2f in zmq::signaler_t::wait(int) () from /home/btorpey/install/libzmq/4.2.3/dev/lib64/libzmq.so.5.1.3
#2  0x00007f36a79c9d5f in zmq::mailbox_t::recv(zmq::command_t*, int) () from /home/btorpey/install/libzmq/4.2.3/dev/lib64/libzmq.so.5.1.3
#3  0x00007f36a79f5496 in zmq::socket_base_t::process_commands(int, bool) () from /home/btorpey/install/libzmq/4.2.3/dev/lib64/libzmq.so.5.1.3
#4  0x00007f36a79f4c8c in zmq::socket_base_t::send(zmq::msg_t*, int) () from /home/btorpey/install/libzmq/4.2.3/dev/lib64/libzmq.so.5.1.3
#5  0x00007f36a7a148b9 in s_sendmsg(zmq::socket_base_t*, zmq_msg_t*, int) () from /home/btorpey/install/libzmq/4.2.3/dev/lib64/libzmq.so.5.1.3
#6  0x00007f36a7a14a00 in zmq_send () from /home/btorpey/install/libzmq/4.2.3/dev/lib64/libzmq.so.5.1.3
#7  0x0000000000404f35 in sendCommand(void*, zmqControlMsg*, int) ()
#8  0x00000000004044e9 in sendControlMsg(void*, char, void*, long) ()
#9  0x0000000000404564 in commandLoop(void*) ()
#10 0x00000033c0607aa1 in start_thread () from /lib64/libpthread.so.0
#11 0x00000033bfee8bcd in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7f36a5fd6700 (LWP 1401)):
#0  0x00000033bfee91c3 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f36a79c309e in zmq::epoll_t::loop() () from /home/btorpey/install/libzmq/4.2.3/dev/lib64/libzmq.so.5.1.3
#2  0x00007f36a79c3352 in zmq::epoll_t::worker_routine(void*) () from /home/btorpey/install/libzmq/4.2.3/dev/lib64/libzmq.so.5.1.3
#3  0x00007f36a7a08c7a in thread_routine () from /home/btorpey/install/libzmq/4.2.3/dev/lib64/libzmq.so.5.1.3
#4  0x00000033c0607aa1 in start_thread () from /lib64/libpthread.so.0
#5  0x00000033bfee8bcd in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7f36a55d5700 (LWP 1402)):
#0  0x00000033bfee91c3 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f36a79c309e in zmq::epoll_t::loop() () from /home/btorpey/install/libzmq/4.2.3/dev/lib64/libzmq.so.5.1.3
#2  0x00007f36a79c3352 in zmq::epoll_t::worker_routine(void*) () from /home/btorpey/install/libzmq/4.2.3/dev/lib64/libzmq.so.5.1.3
#3  0x00007f36a7a08c7a in thread_routine () from /home/btorpey/install/libzmq/4.2.3/dev/lib64/libzmq.so.5.1.3
#4  0x00000033c0607aa1 in start_thread () from /lib64/libpthread.so.0
#5  0x00000033bfee8bcd in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f36a73da720 (LWP 1397)):
#0  0x00000033c06082fd in pthread_join () from /lib64/libpthread.so.0
#1  0x0000000000405374 in main ()
[/home/btorpey] 

  1. If it does not, does it work reliably when you add a socket monitor and wait for a ZMQ_EVENT_DISCONNECTED after calling zmq_disconnect?

As above, I'll try this with both a mutex and the socket monitor -- the latter will take a bit more work. I'll post back here when I have something.

@sigiesec
Copy link
Member

That's interesting - I didn't know that. I should probably get more familiar with the spec, so I checked rfc.zeromq.org/spec:31/EXPAIR, but frankly I don't see where it says that.

The 31/EXPAIR spec says:

MAY be connected to at most one PAIR peers

I am not completely sure what you are trying to do overall, but I think you should somehow use DEALER and ROUTER sockets to achieve your goals. The basic mechanism is described here http://zguide.zeromq.org/page:all#The-Asynchronous-Client-Server-Pattern and there are several more advanced patterns based on that.

I was also thinking of putting a mutex on the SUB side of the PAIR and seeing if there's any difference.

That would synchronize your threads and might solve the issue above, but this is not what I meant. I meant synchronizing your thread with the libzmq internal asynchronous handling.

BTW, I seem to remember reading somewhere that connects on inproc sockets are synchronous, but I can't find the reference -- is that true?

connect -> yes
disconnect/close -> no

If that's the only hurdle, maybe it would help if I switched to Boost threads? I think that should work on Windows, but I don't have a Windows machine to test with.

Hm... std::thread would be better for me testing it locally.

However, if this were to be included in the 0MQ test suite, neither could be used, but only zmq_thread*, but these do not provide TLS.

Does your program work reliably with numThreads == 1?

Yes, in the sense that it will eventually hang in zmq_send.

That's not what I meant with "reliably" ;) So basically this means "no".

As above, I'll try this with both a mutex and the socket monitor -- the latter will take a bit more work. I'll post back here when I have something.

Ok, thanks. But please try with a single thread.

@WallStProg
Copy link
Contributor Author

I am not completely sure what you are trying to do overall, but I think you should somehow use DEALER and ROUTER sockets to achieve your goals.

I ended up using CLIENT/SERVER sockets for my application: they're thread-safe, which is handy, and a bit simpler since they don't require multi-part messages.

That would synchronize your threads and might solve the issue above, but this is not what I meant. I meant synchronizing your thread with the libzmq internal asynchronous handling.

You're right -- wrapping the connect/send/disconnect in a mutex doesn't help at all. (WallStProg/zmqtests@f9a448d#diff-0919fe44fdbbd233e5e2e8587006b7b2)

Ok, thanks. But please try with a single thread.

I think this is already answered -- a single thread doing connect/send/disconnect will eventually deadlock on the zmq_send (see earlier stack trace).

I'll try tapping into zmq_socket_monitor next, but that will be a bit more work -- will report back.

Thanks again!

@stale
Copy link

stale bot commented May 19, 2019

This issue has been automatically marked as stale because it has not had activity for 365 days. It will be closed if no further activity occurs within 56 days. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants