Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multipart messages not fully atomic on push/pull sockets #1244

Closed
takluyver opened this issue Dec 7, 2018 · 10 comments
Closed

Multipart messages not fully atomic on push/pull sockets #1244

takluyver opened this issue Dec 7, 2018 · 10 comments

Comments

@takluyver
Copy link
Contributor

I think we've (mostly @tmichela) stumbled across a case where multipart messages are not delivered atomically. I've tried to distil a minimum reproducible example in this gist.

  1. Start atomicity_issue_push.py running in one terminal.
  2. In another terminal, run atomicity_issue_pull.py repeatedly.
  3. It should always receive 2 parts, as it does the first time it's run, but subsequent runs often (not always) receive 1 part.

I think it's likely something to do with the coincidence of sending large messages (0.5 GB) and exiting the process soon after receiving the message - if I insert time.sleep(1) at the end of the pull script, I can't reproduce it. The process should probably be cleaning up the context properly, but consider this as a simulated crash: it shouldn't affect things for other processes.

The zmq_send docs say:

ØMQ ensures atomic delivery of messages: peers shall receive either all message parts of a message or none at all.

Perhaps this is a problem in libzmq, but I'm much more confident investigating it in Python than in C, so I thought I'd bring it up here first. It's possible it's related to zeromq/libzmq#1588, which we ran into first; only after upgrading to get the fix for that did we start seeing this issue.

@takluyver
Copy link
Contributor Author

Checking the length of message parts confirms that when it only receives one part, that is the second part, and the complete part (512 MB) is received.

@minrk
Copy link
Member

minrk commented Dec 9, 2018

I think you're absolutely right that this is a libzmq issue. @bluca any idea how a large multipart message and a previous receiver could result in partial delivery of future messages, omitting the first part of messages?

I can reproduce this with pyzmq 17.1.2 and libzmq 4.2.5 on macOS.

From looking at the fix for the issue you mentioned, rollback is only called for failed message part deliver other than the last, so if it's the last frame that fails, it won't trigger the rollback. I think perhaps the fix there is to call rollback if it's a multipart message (as the comment describes), rather than if (more) which only checks if it's part of a multipart message and not the last. I'm not sure if that's the issue or not, but it at least looks like the comment and code don't match up.

@bluca
Copy link
Member

bluca commented Dec 9, 2018

Interestingly, I can reproduce only with per-msg data buffer, not with a shared one.

https://gist.github.com/bluca/6def6f11d65fea2017e842d20cec7d80

@bluca
Copy link
Member

bluca commented Dec 9, 2018

ah it's a timing issue, allocating the large buffer between sending the first part and the second adds enough delay, doing zmq_msg_init_size before the first send makes the issue disappear

@bluca
Copy link
Member

bluca commented Dec 9, 2018

The problem is that the linked fix (that does the rollback) also set _more to false, which means that when the pipe is terminated it will never get into dropping mode. Removing that fixes the issue, but I'm not sure if it's the right thing to do, I'll try to have a look again later this week.

@bluca
Copy link
Member

bluca commented Dec 28, 2018

I have a solution which appears correct, at least in the sense that it doesn't change the current behaviour and it solves the problem at hand.

@bluca
Copy link
Member

bluca commented Dec 28, 2018

As a workaround, you can use the equivalent of ZMQ_DONTWAIT and it should fail and bail out instead of re-sending the last part when the new socket connects.

@bluca
Copy link
Member

bluca commented Dec 30, 2018

Should be fixed by zeromq/libzmq#3343 please try again with the latest libzmq master

@minrk
Copy link
Member

minrk commented Jan 3, 2019

Thanks @bluca!

@takluyver
Copy link
Contributor Author

I just tried with pyzmq 18.0.1, and I can no longer reproduce it. It appears that the fix is in libzmq 4.3.1, which is bundled by pyzmq 18.0. So I'll close this. Thanks for dealing with it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants