-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First subscriptions seem to be dropped when a PUB connects to a SUB #2267
Comments
This is an old and known problem, and this is why usually the PUB binds and the SUB connects. https://zeromq.jira.com/browse/LIBZMQ-270 Basically, given the connect is asynchronous, there is nothing to send the subscribe to. The internal commands are processed in the application thread rather than in the I/O thread, so if you look with GDB you will see that when the zmq_recv call runs only at that point the subscription message is actually sent, because only at that point the sub socket context in the application thread sees that the pub is now connected and it can do it. A workaround usable in real-world application is to poll, since that will also take care of running the internal state machine/commands, as highlighted in the example linked on that jira ticket: https://gist.github.com/hintjens/7344533 So in your code, if you poll before and after the zmq_send, it will then work. |
This should be fixed, but as it's written in the mentioned Jira ticket it would be quite complicated. |
@bluca Thank you for explaining. I understood that bug. I think the documentation for Is it okay that I don't close this issue until we have actually fixed the bug even though this issue is duplicated with an issue on the Jira? |
Yes the Jira ticketing is deprecated and read-only |
Okay. Does your team have someone who can fix the issue? |
I do not have a team :-) I don't think anyone is actively looking into this at the moment. |
I'm trying to understand this, and apparently I'm not doing a very good job so far ;-) First, let me see if I can describe the problem accurately:
There are a number of references to this problem, incuding:
Several of these suggest doing a Pieter's workaround (https://gist.github.com/hintjens/7344533) has the sub socket calling zmq_poll after the pub socket connects to it. That works fine in the simple example provided, but in the real world the sub and pub sockets are going to be in separate programs, so:
There is another problem that I think I may be running into, which is that my sub socket is ALWAYS in a
(Setting a timeout on the zmq_poll call should work, but I don't know that the timeout can be set low enough to be useful, without burning a CPU core). In an earlier comment in this thread, Luca says:
At this point, the only general solution I can think of is to use Or is there an easier way? I've been calling Thanks in advance for any advice, suggestions, etc.! |
I've put together an exhaustive survey of the inter-thread communications methods here: https://github.com/WallStProg/zmqtests/tree/master/threads. This includes a sample of using PUB/SUB sockets for inter-thread communication that illustrates the underlying problem, as well as problems with the suggested work-arounds. Short version is that I've been unable to come up with any workaround that reliably avoids losing initial messages when PUB connects to SUB. Polling on the PUB side helps a bit, but not enough to be reliable -- it looks like the poll needs to happen on the SUB side, but at least in the most common scenarios I can come up with the SUB side is already sitting in a
If there's a way to resolve this issue that I haven't been able to find, I'd be very grateful to learn it. Thanks in advance! |
Any progress on this issue? |
On my end, I've been running a bunch of tests that I think demonstrate that the "work-around" laid out in https://gist.github.com/hintjens/7344533 is unreliable -- the code will eventually deadlock on the zmq_recv call. (In one test, this happened on iteration # 723). I haven't yet had time to write this up in a way that is rigorous enough to present, but I hope to be able to do so soon. |
It turns out that there's nothing special needed -- simply run the code repeatedly and it will eventually hang on the call to So far as I can tell the work-around doesn't... |
Every socket has a file descriptor attached to it for signaling. So, adding SUB to the main loop zmq_poll should allow the SUB to handle all incoming connections. The workaround fails because sometimes the connecting phase takes more than 1ms (the timeout for the zmq_poll). Specific fix to the workaround can be to have the zmq_poll in a loop and to query the number of connected peers (which is not possible at the moment).
@WallStProg @mesca @sublee will that help? |
Hi Doron: Thanks for the update.
What I'm saying is that the canonical "work-around" for this issue, originally posted by Pieter (calling zmq_poll to trigger process_commands), is not at all reliable. Unfortunately, newbies keep getting pointed to it, but the fact is that it just flat doesn't work -- sometimes (most of the time, in fact), you "get lucky", but that's completely non-deterministic.
OK, finally an explanation that makes sense!
I'm sure that would be very helpful for a lot of people, but in my case it's not -- my network is dynamic, so clients don't know how many peers they are supposed to have. |
Out of curiosity, do push/pull sockets suffer from this issue as well? I'm wondering if a better topology (for my use case) might be:
...with the broker in the middle republishing the messages received from the PULL socket. Does this get around the issue? I know this has the disadvantage of every "published" message going through the broker, but in my case I don't care — that's happening anyhow. |
Hi Jonah:
A few things to think about:
- One potential issue that has nothing to do with 0mq is how you deal with what are sometimes called “late joiners”. Put another way, streaming updates are like streams of water — you can only step into it at one place, and everything that went by before that is simply gone. If that doesn’t work for your application, you may be better served with some kind of queueing/store-and-forward solution, which 0mq is not. (You can build a queueing solution with 0mq, but it doesn’t do that out of the box).
- Having a central broker doesn’t solve the original problem (missing messages that are published right around the time a subscriber connects) — it just moves the problem to the broker. Subscribers still have to connect to the broker, and can still miss messages that are published prior to the *completion* of the connect. (This is because filtering is done on the publisher, not the subscriber, for point-to-point protocols like TCP).
- For a subscriber to know that it’s connected to a broker, take a look at the “welcome message” functionality added by Doron: https://somdoron.com/2015/09/reliable-pubsub/ <https://somdoron.com/2015/09/reliable-pubsub/>. This lets a client connect to a publisher (which may be a broker) and know for sure once it is connected. Our application uses this feature to make sure that published messages don’t go into the “bit bucket” — the application subscribes to the welcome message, and only starts publishing after receiving it. This of course doesn’t solve the “late joiner” problem, as discussed above.
- If you have a central broker, you also have a single point of failure. Depending on your application, you may need multiple brokers to ensure that a single failure doesn’t bring down your application. Once you do that, you also have to deal with duplicate messages, with attendant increases in bandwidth usage, etc.
- If none of the above works for you, you might want to consider a multicast protocol (i.e., PGM). With multicast, every subscriber receives every message, and filtering is done on the subscriber, not the publisher.
Hope this helps,
Bill
… On Apr 4, 2019, at 10:33 AM, Jonah Petri ***@***.***> wrote:
Out of curiosity, do push/pull sockets suffer from this issue as well? I'm wondering if a better topology (for my use case) might be:
+----+
+----+ |PUSH| +----+
|PUSH| +-+--+ |PUSH|
+----+ | +----+
\--\ | /---/
| | |
+------+
| PULL |
| | |
| XPUB |
+------+
+-------/ | \------+
| | |
+--+--+ +--+--+ +--+--+
| SUB | | SUB | | SUB |
+-----+ +-----+ +-----+
...with the broker in the middle republishing the messages received from the PULL socket. Does this get around the issue? I know this has the disadvantage of every "published" message going through the broker, but in my case I don't care — that's happening anyhow.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#2267 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGUfn-Ee1mx8TapDt54kDwzOFrwbhvVZks5vdg1DgaJpZM4LTm5h>.
|
Hello Bill!
|
This issue has been automatically marked as stale because it has not had activity for 365 days. It will be closed if no further activity occurs within 56 days. Thank you for your contributions. |
subscribe doesn't make it to the pub socket in some cases, causing initial messages to be lost zeromq/libzmq#2267
I usually use
PUB
sockets as clients ofSUB
sockets. On this topology, first subscriptions fromSUB
sockets always seem to be dropped.@kexplo who is my co-worker made a code to reproduce this issue by C++. There is a switch constant named
PUB_CONNECTS
. If it istrue
, aPUB
socket connects to aSUB
socket then theSUB
socket won't receive the message thePUB
socket sent. Otherwise, aSUB
socket connects to aPUB
socket and everything is going to be fine:I tested with ZeroMQ-4.2.0.
The text was updated successfully, but these errors were encountered: