Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZMQ_REQ_CORRELATE may not work when sent messages are not received immediately #1695

Closed
FredTreg opened this issue Jan 5, 2016 · 1 comment

Comments

@FredTreg
Copy link
Contributor

FredTreg commented Jan 5, 2016

Problem

Considering a REQ socket with both the ZMQ_REQ_RELAXED and ZMQ_REQ_CORRELATE options enabled, if two messages are sent in a row, then the next recv() done on the socket will fail to receive the answer corresponding to the second message (as would otherwise be expected) if the server connected to the REQ socket is bound only after the two messages are sent. Answer from the first message is received instead.

The example code pasted at the end of this issue reproduces this behavior.

Notes:

  1. This issue is linked to issue ZMQ_REQ_RELAXED does not work #1690 and its fix. Without the fix the issue cannot happen but the ZMQ_REQ_RELAXED option cannot be used either.
  2. The problem exposed is only one example of a more general issue regarding the option ZMQ_REQ_CORRELATE. The incorrect correlation issue may arise with other setups such as when using an inproc:// ROUTER server bound before the messages are sent but slow to respond to them.

Analysis

By checking the message parts sent, I could track the issue down to this piece of code in the req.cpp file:

int zmq::req_t::xsend (msg_t *msg_)
{
    ... 

        if (request_id_frames_enabled) {
            request_id++;

            msg_t id;
            //  --> Next line causes the issue: request_id may change in the future
            //  --> as it is not duplicated prior to being sent to the pipe.
            int rc = id.init_data (&request_id, sizeof (request_id), NULL, NULL);
            errno_assert (rc == 0);
            id.set_flags (msg_t::more);

            rc = dealer_t::sendpipe (&id, &reply_pipe);

    ...

The request_id in the code above is a member variable and is not copied in memory to a new variable when fed to the init_data method. So multiple sent messages share the same variable.

As a consequence, when issuing two send() in a row, the first message may not have been sent yet down the wire when the second message is sent to the pipe. The request_id value of the first message is then overridden by the request_id value of the second message before being wired. Sent messages no longer have unique ids and can no longer be correlated.

Solutions

  1. A first solution would be to be able to clear the send pipe of any unwired messages before sending the next message down the pipe. I am unfamiliar with the internals of zmq and could not find an obvious API that could do the job. For example the method terminate() that was used prior to the fixing of defect ZMQ_REQ_RELAXED does not work #1690 was too much as it would close the send pipe forever.
  2. So another solution would be to create a copy of request_id before sending it to the pipe. Though this solution has a small performance impact, it is easy to understand and works whatever the - potentially unknown and changing - intrisics of zmq are.

Unless someone comments on these solutions, I will submit a pull request implementing solution 2. in a couple of days.

Example code

Example code demonstrating the issue (without error handling to make it clearer):

//  Utility function which reads a message and sends it back unchanged.
void bounce (void *socket)
{
    int more;
    size_t more_size = sizeof (more);
    do {
        zmq_msg_t recv_part;
        zmq_msg_init (&recv_part);

        zmq_msg_recv (&recv_part, socket, 0);
        zmq_getsockopt (socket, ZMQ_RCVMORE, &more, &more_size);

        zmq_msg_t sent_part;
        zmq_msg_init (&sent_part);
        zmq_msg_copy (&sent_part, &recv_part);
        zmq_msg_close (&recv_part);

        zmq_msg_send (&sent_part, socket, more ? ZMQ_SNDMORE : 0);
    } while (more);
}

int main (void)
{
    int enabled = 1;

    //  Setup and connect REQ socket as client.
    void *ctx = zmq_ctx_new ();
    void *req = zmq_socket (ctx, ZMQ_REQ);

    zmq_setsockopt (req, ZMQ_REQ_RELAXED, &enabled, sizeof (int));
    zmq_setsockopt (req, ZMQ_REQ_CORRELATE, &enabled, sizeof (int));

    zmq_connect (req, "tcp://localhost:5555");

    //  Setup ROUTER socket as server but do *not* bind it just yet.
    void *router = zmq_socket(ctx, ZMQ_ROUTER);

    //  Send two requests.
    s_send_seq (req, "A", SEQ_END);
    s_send_seq (req, "B", SEQ_END);

    //  Bind server allowing it to receive messages.
    zmq_bind (router, "tcp://127.0.0.1:5555");

    //  Read the two messages and send them back as is.
    bounce (router);
    bounce (router);

    //  Read the expected correlated reply. As the ZMQ_REQ_CORRELATE is active,
    //  "A" should be ditched and "B" should be read.

    s_recv_seq (req, "B", SEQ_END); // <-- This will fail, "A" is read

    zmq_close (req);
    zmq_close (router);
    zmq_ctx_term (ctx);
}
FredTreg added a commit to FredTreg/libzmq that referenced this issue Mar 20, 2016
Problem: when using ZMQ_REQ_RELAXED + ZMQ_REQ_CORRELATE and two 'send' are
executed in a row and no server is available at the time of the sends,
then the internal request_id used to identify messages gets corrupted and
the two messages end up with the same request_id. The correlation no
longer works in that case and you may end up with the wrong message.

Solution: make a copy of the request_id instance member before sending it
down the pipe.
FredTreg added a commit to FredTreg/libzmq that referenced this issue Mar 20, 2016
Problem: when using ZMQ_REQ_RELAXED + ZMQ_REQ_CORRELATE and two 'send' are
executed in a row and no server is available at the time of the sends,
then the internal request_id used to identify messages gets corrupted and
the two messages end up with the same request_id. The correlation no
longer works in that case and you may end up with the wrong message.

Solution: make a copy of the request_id instance member before sending it
down the pipe.
hintjens added a commit that referenced this issue Mar 24, 2016
Fixed issue #1695 (ZMQ_REQ_CORRELATE)
@hitstergtd
Copy link
Member

@FredTreg,
Thanks for the report. Closing issue since your pull request was merged by @c-rack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants