Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REQ/REP message ordering looses messages #176

Open
kschiess opened this issue Jun 23, 2017 · 4 comments
Open

REQ/REP message ordering looses messages #176

kschiess opened this issue Jun 23, 2017 · 4 comments
Labels

Comments

@kschiess
Copy link

kschiess commented Jun 23, 2017

I am trying to send multiple messages through req/rep. This works for simple examples where I just send a single message - but as soon as there is more than one req, I only see some of the reps.

I expect this test to succeed:

var nano = require('../');
var test = require('tape');

test('reqrep sending n messages', function (t) {
    t.plan(2);

    var req = nano.socket('req');
    var rep = nano.socket('rep');

    var addr = 'inproc://reqrep';

    rep.bind(addr);
    req.connect(addr);
    
    var repMsgs = [];
    var reqMsgs = []; 

    rep.on('data', function (buf) {
        repMsgs.push(String(buf));
        rep.send(repMsgs.length)
    });
    req.on('data', function (buf) {
        reqMsgs.push(String(buf));
    })
    
    for (let el of ['1', '2', '3']) {
        req.send(el);
    }
    
    setTimeout(function () {
      t.equal(repMsgs, ['1', '2', '3']); 
      t.equal(reqMsgs, ['1', '2', '3']); 
    }, 100);
});

Typical output looks like this:

TAP version 13
# reqrep sending n messages
not ok 1 should be equal
  ---
    operator: equal
    expected: [ '1', '2', '3' ]
    actual:   [ '1', '3' ]
    at: Timeout._onTimeout (/Users/kschiess/git/3rd/nodelibs/node-nanomsg/test/reqrep.js:31:9)
  ...
not ok 2 should be equal
  ---
    operator: equal
    expected: [ '1', '2', '3' ]
    actual:   [ '2' ]
    at: Timeout._onTimeout (/Users/kschiess/git/3rd/nodelibs/node-nanomsg/test/reqrep.js:32:9)
  ...

In a nutshell, the 'rep' server only sees messages 1 and 3, answers to both, but since 3 has already been read before sending happens, only that last answer is received by the req client.

I am guessing what happens is that recv is called for more than one msg at a time, which makes nanomsg expect answers for only the last recv. The only way of fixing this that I see is to change the interface of the 'data' event to include some origin information (nanomsg control headers) and to be able to then pair up answers to control headers through something like rep.answer(control, data).

@reqshark
Copy link
Collaborator

what operating system? is it the same for inproc as tcp?

btw I think inproc is slightly busted on this version of libnanomsg, it's definitely worse on some systems than others.. so we ought to update our libnanomsg version, haven't done that in a while

there are two things I would do to improve this test..

  1. don't send immediately in the same eventloop tick as calling connect() and bind(), which might still be playing around in kernel land, internally converting a socket from types listen --> bind --> accept, etc so you can expect some latency there, use a short timeout or setImmediate maybe before trying to send when the sockets might still be caught up in a half connected state, and this is less important a thing for something like inproc or ipc protocol and much more significant for tcp for obvious reasons but ACKs aside, it's still a good practice to wait after calls connect and bind

  2. I wouldn't even try sending a req until you recv its rep. sending that way in a fast loop of req.sends before rep can begin processing them doesn't work too well. Node streams aren't truly backpressured, so your last req.send call expecting 3 has already hit kernel's TX buffer before the first RX buf has been recognized by rep... I blame libnanomsg partially but also equally or more: Node streams implementation. You might try to set the highwatermark to 1 on your stream--and make sure it's in object mode.

still curious what system ur reporting from and what happens in that arrangement w/ TCP?

@kschiess
Copy link
Author

This happens on all modes of transport.

Re 1: The first message reaches the rep server successfully. So it seems to me that sockets are set up. I was expecting to maybe loose the first message due to setup, but once messages get through, no thing should be lost. What happens here (loosing message 2) is a bug.

Re 2: My expectation from the exposed interface ('data' event) is that for every req send I'll get a rep 'data' event. Usually these two will not be in one process as is done here. My setup has two processes and the results are similar. Are you saying that I should wait for 'rep' to receive my send? How can that be done in a two process setup?

Usually, when using nanomsg in blocking mode, you'd do a send and then a recv. So nanomsg tracks what send you did last and will only give you the response to that request. I guess what happens here is that sends happen in sequence - and because of the async nature of the node/binding code no recvs are issued for the sends. So nanomsg will not even yield answers that might or might not arrive on the network to the binding code - since another request has been issued in the meantime!

This also means that if we go and add delays to the test, we're cheating ourselves out of the problem.

I see a solution using sendmsg and recvmsg, and passing the control header around. This is the only thing I see that can work in connection with node's thread model. That would mean that the binding code would pass 'buffer' and 'control' into the 'data' event and that the event itself would have to call sendmsg, passing control back in. Otherwise I fail to see how nanomsg can even work in this context. It's not so much a bug of the underlying lib, it's that we don't provide context for proper operation, IMHO.

Does the above make sense to you?

(System is OS X and Linux)

@reqshark
Copy link
Collaborator

thanks for embellishing on a couple points i skipped in your first message, namely:

I am guessing what happens is that recv is called for more than one msg at a time

based on your test output, to me it looks like recv/REP is only getting called twice, because after the first req.send, the rep has not replied yet; any and all subsequent premature calls to req.send are bound to be dropped or ignored.

one way to fix that would be to check for a library error code. If one was set by the C library then we could check for it and emit a javascript stream error. So if such an error was set, it might be associated with send() at that point. Another way to detect this type of program error, if none was set by the library, would be to compare the size of the buffer passed in from JavaScript land against the return value of nn_send which returns a number of bytes sent over the protocol. If there is a gross size mismatch or something like zero (msg got dropped), we'll have enough information to emit a JavaScript error... or alternatively, if we don't want to bother the user so much with stream errors in JavaScript, we could back-off and retry calls to send()

you're suggesting something else:

change the interface of the 'data' event to include some origin information (nanomsg control headers) and to be able to then pair up answers to control headers through something like rep.answer(control, data).

this sounds interesting because we still haven't exposed the control headers from libnanomsg and it would be a nice thing to expose.

Although I am unconvinced your test case establishes a msg ordering problem in REQ/REP, I'm totally open to adding options to req/rep or more likely to other socket types by way of control msg headers, and think that's a great idea!

now answering more points from your last reply:

Are you saying that I should wait for 'rep' to receive my send?

Yes. 100% you don't have a choice here except to do so.

How can that be done in a two process setup?

Wait for the msg from REP!

Only then, can you call req.send again.

Not before hearing back from REP!

This also means that if we go and add delays to the test, we're cheating ourselves out of the problem.

fair enough, forget i said anything about delays in the test, timeouts are dirty escape hatches :)

The rest of your comments are drawn based on an expectation of behavior that was not written into the REQ/REP protocol as we use it now. We only expose the simple nn_send/nn_recv calls and there is still no support for send_msg and recv_msg!

hint hint: please send us a PR exposing those calls to JavaScript :)

Besides that ^^ would still be step 1 in your proposal to modify REQ/REP behavior, which is not a bad idea, despite my small disagreement with the expected behavior you want from REQ/REP in it's current form.

@reqshark reqshark added question and removed bug labels Jun 23, 2017
@kschiess
Copy link
Author

Fair enough; I see your position and I feel you see mine. I'll try to come up with some actual code to support my position, then we can discuss the details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants