-
Notifications
You must be signed in to change notification settings - Fork 637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multipart messages not fully atomic on push/pull sockets #1244
Comments
Checking the length of message parts confirms that when it only receives one part, that is the second part, and the complete part (512 MB) is received. |
I think you're absolutely right that this is a libzmq issue. @bluca any idea how a large multipart message and a previous receiver could result in partial delivery of future messages, omitting the first part of messages? I can reproduce this with pyzmq 17.1.2 and libzmq 4.2.5 on macOS. From looking at the fix for the issue you mentioned, rollback is only called for failed message part deliver other than the last, so if it's the last frame that fails, it won't trigger the rollback. I think perhaps the fix there is to call rollback if it's a multipart message (as the comment describes), rather than |
Interestingly, I can reproduce only with per-msg data buffer, not with a shared one. https://gist.github.com/bluca/6def6f11d65fea2017e842d20cec7d80 |
ah it's a timing issue, allocating the large buffer between sending the first part and the second adds enough delay, doing zmq_msg_init_size before the first send makes the issue disappear |
The problem is that the linked fix (that does the rollback) also set _more to false, which means that when the pipe is terminated it will never get into dropping mode. Removing that fixes the issue, but I'm not sure if it's the right thing to do, I'll try to have a look again later this week. |
I have a solution which appears correct, at least in the sense that it doesn't change the current behaviour and it solves the problem at hand. |
As a workaround, you can use the equivalent of ZMQ_DONTWAIT and it should fail and bail out instead of re-sending the last part when the new socket connects. |
Should be fixed by zeromq/libzmq#3343 please try again with the latest libzmq master |
Thanks @bluca! |
I just tried with pyzmq 18.0.1, and I can no longer reproduce it. It appears that the fix is in libzmq 4.3.1, which is bundled by pyzmq 18.0. So I'll close this. Thanks for dealing with it! |
I think we've (mostly @tmichela) stumbled across a case where multipart messages are not delivered atomically. I've tried to distil a minimum reproducible example in this gist.
atomicity_issue_push.py
running in one terminal.atomicity_issue_pull.py
repeatedly.I think it's likely something to do with the coincidence of sending large messages (0.5 GB) and exiting the process soon after receiving the message - if I insert
time.sleep(1)
at the end of the pull script, I can't reproduce it. The process should probably be cleaning up the context properly, but consider this as a simulated crash: it shouldn't affect things for other processes.The zmq_send docs say:
Perhaps this is a problem in libzmq, but I'm much more confident investigating it in Python than in C, so I thought I'd bring it up here first. It's possible it's related to zeromq/libzmq#1588, which we ran into first; only after upgrading to get the fix for that did we start seeing this issue.
The text was updated successfully, but these errors were encountered: